“What will the performance be like?”

Despite the documentation, blog posts and experiences, you’re just going to have to test it…

AWS offers a dazzling array of services for similar things. Dazzling is a great synonym for “confusing”. Feels like “Speed of Innovation”, “Range of Services” and “Coherence of Offerings” is a “pick 2 of 3” situation…

An often asked question on the AWS communities I hang out on is “can we use a Lambda for that?”

People asking this are often primed with some of the problems of serverless compute, and jump to the advanced features to mitigate them:

  • “I’m worried about cold starts, do I need to pay for provisioned concurrency?”
  • “I’m accessing a database, do I need to add an RDS Proxy?”
  • “I worry about response time, do I need to enable API Gateway caching?”
  • “I worry about out-of-order”, do I need First-In First-Out queues1

These are valid questions, with valid solutions, but paraphrasing a global sportswear brand: Just Try It.

Deploy your code, and see how it performs/responds. Then apply the tricks to make it better.

Starting with the dirtiest hack of all: Make it bigger.

Lambda CPU is only controllable by memory, and I’ve been in multiple situations where there was a step-change in performance jumping from 64 to 128 megabytes. There’s tooling that can help you find the right size for your application.

It’s lazy, brute force, but takes literally seconds to change the memory and re-test.

I’m a big proponent of ‘innovation as Business as Usual’: you shouldn’t need the excuse of an innovation day to try the new things. That only works if you start simple.

But I’m an even bigger proponent of not stressing yourself/teams unnecessarily: If you don’t have time to test in your current project, do what you know works. Hit your shipping dates, have an easier sprint.

Hopefully you can have a play in the future and add something else list of what “works”, it’s nice to expand that list, but only when you’ve enough headroom to do so.

In conclusion, while sometimes I can tell you that something is unlikely to work, most of the time you’re just going to have to try it.

Sorry.

  1. I understand FIFO messaging has its place, but I work in content, not transactions. Including a ‘reliable’ timestamp in the event (e.g. the time an article was published, not the time the message was sent) is cheaper and allows consumers to identify if they should process an event, allowing easier recovery if databases have to be re-crawled. ↩︎

Systems are designed in a context

I’ve been following a bit of Internet Of Things drama because a company Cease & Desisted a developer who was polling their API for an unofficial Home Assistant Integration.

Thankfully, it looks like the company is now engaging with the Developer and The Home Assistant team, so maybe it can be resolved.

But one of the recurring comments initially was “if they’re saying it costs too much money, they should use a cheaper host than AWS”

Disclosure, my career is helping companies use the right bits of AWS well, and for all my discomfort at the centralisation of the internet into the Hyperscalers – the ability to deploy a website that costs nothing if nobody uses it, yet scales to match demand, is pretty compelling.

“What Context do you mean Gareth?”

Backing up a second, this integration polls the endpoint pretty aggressively, and so could be causing a noticeable spike to API calls, and to the costs experienced by Haier.

If there were about 500 active installs, that could cost something between 500 and 1,500 USD per month, based on the polling rates, probably immaterial in the grand scheme of things, but noticeable.

The context I’m talking about, is that they built this API knowing it wasn’t going to be called that much… many of the interactions customers saw would be driven by push notifications or other event driven things. It’s only called by the app/website when people visit… it was never costed for being called continuously.

They didn’t optimise that cost, because it didn’t make sense to.

But still, why aren’t they doing it cheaper?

Bluntly because “total cost of operation” (TCO) is more than just the compute.

Sure I could just “spin up a VM and do it all myself” but then I’m doing it all myself. I’m suddenly in the realm of patching. I’ve got to do my own auth.

If AWS are terminating your HTTPS endpoint, then AWS are on the hook for the latest HTTP desynchronisation exploit.

When you host stuff on managed services (of any provider) you’re really going hands off, you don’t need an ops team because ultimately, there isn’t generally anything that you can operate… If there’s an outage, often times, you just have to wait for the provider to fix it.

That might feel disempowering, but also, there’s no point having an on-call team and waking someone at 3am to go “Yup, it’s fucked” but be unable to actually fix it.

In summary

There’s usually a cheaper way to deploy things.

The question is “is it actually cheaper, everything else considered?”

Don’t try to second guess other peoples architecture, and if you are going to poll an API, try to do it considerately…

Save Money, deploy IPv6 in your VPC

Getting past uncertainty or unwillingness to use IPv6 is actively costing you money.

I (mostly) have IPv6 deployed in my home network, alas a bug on my router currently prevents it working completely, but for years it’s been enabled, and mostly as a novelty to feel like I was ready for the grand future – but without really noticing any differences.

Last week however, I had a chat that made me realise that using IPv6 is now a cost-saving measure… and it’s maybe time to get over our resistance to do it by default.

The Joy of NAT

Typically you deploy an AWS VPC using internal IPv4 addresses, and (if you’re as allergic to avoidable self-managed service as I am), a managed NAT gateway.

As with nearly everything AWS, there’s an hourly cost, and also charges for the volume of data transferred.

In an AWS group I’m a member of, someone asked “We’re paying a lot for NAT, what can we do” – and I said “I mean, you could try the IPv6 Egress Gateway, but I dunno if the APIs you’re using support IPv6”.

The Egress only IPv6 gateway only charges ‘standard’ egress rates and it has no rental cost. I’ve been aware of its existence for years, but have never deployed it or had reason to.

I was expecting to be told “Only one of my APIs supports IPv6” but the person reported back “Actually nearly all my APIs are on IPv6, but I can’t deploy the IPv6 easily because of <reasons>”.

This was not what I was expecting: despite, remembering that for the last few years wherever I can deploy a dual-stack endpoint, that I have been…

So why haven’t we

There are a few reasons why we haven’t been deploying IPv6 more routinely, and we should try to change these:

  1. IPv6 didn’t have an advantages. In most cases the IPv4 setup works ok… So what’s the point of adding it to a working setup…
  2. The deployment tooling doesn’t support it – this was the case here, the person was using the lovely AWS CDK to deploy their VPC, and the current VPC ‘construct’ doesn’t support IPv6 easily
  3. We like NAT’s Security by default: With NAT, your compute resources aren’t exposed to the internet, all that endpoints see are 1 or 2 shared IPs… There’s no way to accidentally open inbound connections, and that security by default is pleasing. Even though the IPv6 Egress Only gateway doesn’t allow inbound connections, you do feel more exposed, and so much of security is an annoying vibes type thing

Some steps to take

It’s sometimes easy to forget that we have run out of IPv4 addresses, and that we’re muddling through with Carrier Grade NAT and other annoying technologies that work, but make everyone’s lives just that little bit worse… But we can improve this incrementally, and anyone who’s worked with me knows I love making things better gradually.

  1. Enable IPv6 on any managed services you use by default. If you’re running CDN hosts on CloudFront, or an API on API Gateway, those can all run IPv6, and are running it outside of your VPC. There is no security risk to you, and by creating those, you give other people benefit
  2. Enable IPv6 in your VPC, but just on your public subnets. If adding IPv6 makes you nervous, start with ‘just’ doing it on your public subnets, and use it to given your load-balancers IPv6 addresses. This makes life better for other people, and gives learning for the next step
  3. Enable IPv6 with Egress only gateway, for any subnets that have NAT: this is where you start saving money, as hopefully you’re NAT’ing less traffic and saving on those charges

Not every use-case will save money, I haven’t generally had the patterns of bandwidth usage that would benefit from this… but if you’re using a lot of NAT bandwidth, maybe it’s time to look at IPv6 as a potential cost-saving.

About “That” Prime Engineering article

Everyone who has worked with managed cloud services has experienced the moment when it made sense to move away from managed services.

Turns out, so does Prime Video.

Amazon Prime Video recently wrote about how changing away from managed services and writing a more integrated application saved them money. Despite being a few months old, this appeared to blow-up this week, and predictably has caused some cries of “SEE, SEE YOU SHOULD JUST RUN EVERYTHING YOURSELF”.

But to those of use who have been building on AWS (and other providers) for many years, it’s not a surprise, and we all have stories where we’ve done similar.

I say this as someone who is an annoying cheerleader for serverless compute and managed services, but despite that, I have home-rolled things, when it made sense.

How do you solve a problem

When you’re solving a problem, you look at what the managed services that you have available, considering factors like:

  • Your teams experience with the service
  • Limitations on the service, and what it was intended for, against what you’re doing
  • What quotas may apply that you hit
  • How the pricing model works

While pricing for managed-services is generally based on usage, sometimes specific costs will apply more to your workload, e.g. if you’re serving small images, you’ll be more impacted by the per-request cost than the bandwidth charges.

I would be surprised if an experienced architect hasn’t faced a situation where “Service X would be perfect for this, if only it didn’t have this restriction Y, or wasn’t priced for Z”.

My example

We’d built out a system that was performing 3 distinct processing steps on large files.

The system had built out incrementally, and we had the 3 steps on three different auto-scale groups, fed by queues.

While some of the requests could be processed from S3 as a stream, one task required downloading the file to a filesystem, and that download took time.

The users wanted to reduce the end-to-end processing time. Some of the tasks were predicated on passing prior steps, and so we didn’t want to make the steps parallel.

Attempt 1: “EFS looks good”

We used the ‘relatively’ new Elastic File System service from AWS… The first component downloaded the file, subsequent operations used this download.

This also had the advantage that the since the ‘smallest’ service was first, you paid for that download on the cheapest instance, and the more expensive instances didn’t have to download it.

We developed, deployed, and for the first morning it was really quick… until we discovered that we were using burst quota, and spent the afternoon rolling back.

Filesystem throughput was allocated based on the amount stored on the filesystem, but as this was a transient process, we didn’t replenish it quickly enough, and didn’t like the idea of just making large random files to earn it.

Now you can just pay for provisioned throughput, perhaps in a small part because of a conversation we had with the account managers.

Attempt 2: “Sod it, just combine them”

The processes varied in complexity, there was effectively a trivial, a medium, and a high complexity task… So the second solution we approached was combining all the tasks onto a single service… the computing power for the highest task would zoom through the other two tasks, and so we combined them into what I jokingly called “the microlith”.

We didn’t touch the other parts of the pipeline, or the database, they remained in other services, but combining the 3 steps worked.

What did we gain

The system was faster, and even more usefully to operators, more predictable.

Once processing had started you could tell, based on the file size, when the items would be ready…

Much like “lower p90 but higher maximum” feels better for user experience, this consistency was great.

What did we lose

Two of the three components had external dependancies, and this did mean this component was one of the less ‘safe’ to deploy, and while tests built up to defend against that… the impact of failed deploy was larger than you’d want.

In Conclusion

There are always times when breaking your patterns makes sense, the key is knowing what you’re both gaining and losing, and taking decisions for the right reasons at the right times.

Prime video refining an architecture to better meet scaling and cost models, making it less “Pure”, isn’t the gotcha against these services that some people would have you believe.

“Pure” design doesn’t win prizes.

Suitable design does.

You make me install it… you tell me how to remove it

Having installed some invasive ‘online proctoring’ software, I tried to ask the company how to uninstall it.

I was, finally, getting around to doing AWS certifications, and one of the way you can do that is via the Pearson OnVue proctoring system. You run software that limits what you can use your machine for, stops you using multiple windows to look up answer. It asks for a fair bit of access when you run, unsurprisingly.

I installed it, I ran the test to see if it worked, and shortly afterwards was having problems with my computer and the clipboard – and wondered if that was a ‘feature’ of this software.

I have since resolved these separately, but I wanted to uninstall OnVue, but there is NO detail of how to do this on the website.

I have asked all their contact channels, “how do I uninstall it” – I get back a variety of canned responses:

  • “you can’t cancel an exam via email”
  • “if you run a test exam you can see if the software works on your system”

Now given it doesn’t look like it was an “installed” app and just an application that ran from a zip file, uninstalling it could be as simple as not running it ever again, and deleting the download.

But I don’t know if it hasn’t installed a few system extensions or similar, I’ve had a quick look and I don’t think it has but is it too much to expect a webpage owned by Pearson that says that – right now the search results for “remove onvue macintosh” are swamped by advertising pages for Mac removal software.

A document that states what to do online, an answer in the knowledge base so agents respond – is really the bare minimum.

If you make me run/install it, give me a clear way to remove it.

The one in which I learned to build a chat bot, and a bit about how I learn

Why revisiting known problems can be a boring but reliable way to learn, and how to think about that for Hackdays

I’ve two problems I keep coming back to and implementing (or attempting to) every now and again.

First is my home-grown voicemail system, not that anyone actually makes standard telephone calls to me anymore, but after a series of “voicemail with dictation” providers left the market, I rolled something together with Twilio and App Engine.

The other is the data feed for the TfL London Bike here scheme. I have failed to buy myself a bike, and still rely on the hire ones, and before app coverage was good, I had a webpage that could show me the docks around where I live.

These have both been revisited over the years, the Voicemail system got moved from being on Google App Engine which returned a complete web page (remember those), to running on AWS as a Single Page App, listening to an MQTT change feed, and connecting to the AWS services to retrieve data for the React App to update on screen.

Meanwhile, although the original Bike webpage went stale and stopped working, last year I started writing an application in SwiftUI that would display the data. Thanks to changes in iOS/Xcode that project can’t even run to get a screenshot anymore…

I’ve long wanted to be able to text something, and to get a reply about the bike docks nearby. Because I’m an iOS user, the Siri intents are very limited, and text message interacting is easier with voice – this would keep my eyes on the road.

So, after an idea just before bed, over the last few days I’ve created a chatbot, and I can ask it questions about the Hire bikes.

But why do I keep revisiting these problems?

TL;DR They’re boring problems that I understand the domain of.

I learn by doing.

My Professional Development Plan would be summarised as Continually Feeling Just A Little Out Of My Depth and managing to keep up.

It’s very much “I learn this stuff because I HAVE to learn it”.

Outside of that directed work frenzy, I have limited windows for learning – periods when I feel interested and able to commit some time to learning. I’ve found that I can generally learn 1 of 2 things:

  1. Approach: Something new about an existing domain I know: e.g. Using a new language, web framework or API to solve a problem I already broadly know.
  2. Domain: Learning new problem domains with existing technologies, e.g. building out a new website to use a new API, technologies I understand in areas I haven’t previously worked.

When I try to do both at once, I quickly get frustrated and quit. I’m not a full time coder, I architect systems and work with teams to coach them into building things, but I’m not best served building things myself, even though I love to work with those who are building.

When I’m working, the need to get things done powers me through any frustration walls (mostly). But when I’m doing stuff for ‘fun’, that doesn’t happen as much, so I try to only do one of the two things.

So, What does BikeBot do and What Did I Learn?

I can ask BikeBot for the details of a specific hire station if I know the name, which is useful if I’m heading to a place I know well, and just want to get details of bikes or docks that are available.

I can also ask bikebot for the dock Nearest to a point of interest, when I don’t know what the station is called.

Bikebot then returns the data from the TfL API, and is accessible over the phone with speech recognition or could be available by SMS if I configure an integration.

So, by revisiting a familiar problem, I got to learn:

  • AWS Lex, the chatbot tool
  • AWS Location, the tool I use to geocode “I’m at this Place give me dock”
  • AWS Connect, the contact-centre product that makes it accessible to voice over the phone

I now have a neat demo, which with a little more work provides me something I can use myself.

I also re-learned just how easy managed services can make solving problems, because so much of the hard lifting is done for you. I would never find the time to do fuzzy matching of station name to user input, but if I give Lex that list, it’s done for me. Not perfectly, but orders of magnitude better and faster than I’d ever be able to do it myself.

“erm, cool story bruh, but how does this help me?”

If you’re running hackdays, think about how many ideas that teams have, and if the teams are capable of learning both 1 & 2 at the same time.

Some companies frown on “doing things to do with the day job” on hackdays, really wanting more Blue Sky Out There things, but maybe your team aren’t really up for that. Or if they are, they need a bit more planning, so teams and ideas are kind of sketched out ahead of the hackday, along with any of the pre-requisites to make progress quicker.

AWS Launches MediaConnect and almost gives us multicast

It’s Re:invent time, and Amazon have launched a new service to make video routing to the cloud reliable and easier to set-up.

A few weeks back I was at the brilliant DPP Leaders Summit, it was under the Chatham House Rule.1 There were some great speakers, and I particularly loved the exec who, to paraphrase, “If it doesn’t work without months of professional-services, THEN IT ISN’T AN ACTUAL PRODUCT.”2

Anyway one of the speakers was facing rebuilding their entire stack due to ownership changes, and wanted to do so in the cloud. They said “We need multicast and Precision Time Protocol”. Which I can understand, for playout or production applications, the need for those two is pretty clear.

It’s now Re:invent season, which is the point in the year when AWS tend to release a lot of their good stuff. And yesterday they unveiled a new media ingest service AWS Elemental MediaConnect.

It’s a managed service to get your video signals to/from/between your Amazon clouds.

This has historically been a pain: back when I was working on the Video Factory project we initially mooted a box in the cloud that we would send the signal to, and then that would fan out to both archiving and live streaming. This was hard to do, so we side-stepped the issue, and just rapidly uploaded the stream to S3 in consistently sized chunks instead. Later something was put in place to do the streaming, using something that I don’t think has been spoke about too much in public, so I shan’t detail here.

Anyway, this new service allows you to send content to/from an endpoint using standard RTP (with/without Forward Error Correction) or the more reliable but commercial Zixi protocol. The video has an Amazon ARN identifier, which then means that external accounts can have permissions to subscribe to the stream, the documentation says a ‘flow’ can have up to 20 outputs.

How are we going to use this?

  1. Contribution to streaming output: fire the video somewhere and you don’t have to know if/where it’s being used
  2. Contribution for programming: using few Amazon regions, broadcasters could very easily build a global contribution network to backhaul outside-broadcasts very easily
  3. Contribution from a Playout appliance, if your cloud playout outputs to an MediaConnect flow, then you can then output that flow to your broader distribution chain, allowing re-routing of things downstream.

It isn’t multicast within a VPC, it’s not PTP, I suspect the latency involved may be too great to allow it to be used to route between different stages in a virtual playout chain3.

MediaConnect does however simplify integrating cloud processing workflows by providing fixed points at the edges in and out of the cloud.

I’ll be interested to see how people use it.

  1. That it is a singular rule is one of those bits of pedantry I cannot let go of
  2. This is probably a topic for another time, but the fact that so many enterprise vendors expect you to pay for their ‘product’ then explain that ‘oh, no, you can’t just use it out of the box even in a basic manner’ is a bit of a joke
  3. I could be very wrong here, I don’t have a one of those hanging around to test

Anatomy of Ticketing ‘Fail’

Having failed to get tickets for something in an annoying day of pressing reload, I try to write something constructive about scaling for big things

(Or what happens when a company that isn’t eventbrite tries to be eventbrite)

A friend wanted to book some tickets for an event. I had some time today, so I said I’d book them.

For reasons of politeness, I’m not going to name the company. The event was massively over-subscribed, there were always going to be people who were annoyed (kinda like the Olympics). I’m just annoyed because I saw things done generally quite ad-hoc, specific technical bugs hit me.

Tickets were delivered in tranches. This is a sure sign there will be massive peaks in demand…

The hour arrives, and, in your all too typical scenario: www.example.com rapidly stopped responding.

A few minutes later everything went 403’d as they killed all access on the server to get the load down. Not great, but it’s a sign somebody is looking at the problem.

example.com then starts redirecting to http://xxxxxxxxxxxxx.cloudfront.net/url1 with all the individual ticket pages iFramed through to an Amazon EC2 instance (http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com/blah)

The IDs for the events were sequential, some had already been released, and you started to think that people had been gaming the system and ordering tickets prior to their availability windows. This was later denied by the company (which I accept), but given the way the scaling was going, at that time it was all too easy to think they were using security by obscurity to prevent access to the events.

Later in the day, when tickets appeared, it was announced via a tweet. The tweet though didn’t link to the site, but to a mailing list post, which again didn’t reference the actual site.

The site had now changed, example.com was redirecting to http://xxxxxxxxxxxxx.cloudfront.net/url2 again passing through to EC2 instance. Later many people complained on Facebook that they were looking the old page and pressing reload.

Anyway, I tapped. I was at the gym. I was on my iPhone. But I know my credit card number, I know my paypal password, I can even use that tiny keyboard. I’ve topped my starbucks card up while in the queue. I can do this.

There was even a mobile site.

Only the mobile site was erroring because it was asking for a non-existent field/table. I had no way to change my user-agent (and wouldn’t have trusted Opera with my credentials), and in the 10 minutes it took me to get back to my laptop all the tickets had been sold.

No tickets for me+mate. Grumbly me having seen things done badly.

As many will say, this is not life and death – but example.com is primarly not a ticketing company, and that showed today.

If you’re going to compete with the likes of eventbrite, you’re going to have to be as good as eventbrite.

The Constructive “What can we learn” Bit

1. Believe it could happen, no matter how unbelievable.

Ask yourself “if we get off-the-scale load how will we fix it”. Working out volumetrics and scaling is hard, so alongside your “reasonable” load calculations of “we can turn off these bits of our site”, have your plans of “how you’ll move to something big, cloudy and scalable if the unbelievable happens”.

Are there components that you should move upfront? You have something like 15 to 30 minutes of goodwill. What do you need to do upfront, so that in that downtime you can come up fully scaled.

If you’re looking a scalable elastic thing, look at how much it costs to start in that state anyway.

2. Architect things to give you agility

If you can’t host all your website on a scalable platform: Subdomains, DNS expiry times and proxy-passes give you room to move, but only if set up ahead of time.

Had tickets.example.com been available, example.com wouldn’t have had to disappear as it has done until tomorrow. You don’t want your website down for that length of time.

DNS changes can take time, much less if you dial down the expiry times, but again you have to do prior to the event. Amazon’s route 53 is cheap so move the domains ahead of time, and set Times to Live appropriately.

While you’re waiting for that propagation, proxy-passing can be a useful technique to bounce the traffic to the new server, while the DNS propagates. Proxy passing also means that example.com/tickets could have been redirected, rather than an entire domain.

Are you caching what you can at an HTTP level with varnish or a service level with memcache?

3. Be careful sending people onto new URLs that won’t update

Taking the ticketing system off their main website was a good move, but the static page should have remained there. The second you redirected to cloudfront, they were then looking at a page that would get stale.

Many people would have pressed reload, expecting it to appear, but they didn’t because as you can see from above above, the URL changed. They could have used the Cloudfront revocation API, but this wasn’t used.

4. Remember data protection issues

This company used the Virginia data centre (which I think is the AWS default). Without going into the whole world of pain that is data-protection and EU borders – Dublin would have less latent and less problematic compliance wise.

5. Testing is good, as is automatic deployemnt

There were not many tickets and the loading was huge, those were not avoidable. I can’t say the same about the erroring mobile site, that should not have occurred.

6. Rehearse

It’s not fun doing disaster recovery, but if you’re receiving catastrophic load then that is what you’re doing.

Write the script. Have someone else test it.

It’s not a valid plan until you have shown it works.