Gareth Klose

On THAT Excel Issue

What can we actually learn from the government’s Excel related issues?

There have been many comments posted in the last week about “excelgate” or whatever we want to call a life-threating data exchange problem. This post is not about absolving the government of blame for this, or the countless failings they’ve made across the Test & Trace programme. Between the app that everyone who understood iOS Bluetooth told them wouldn’t work, giving the bulk of Contract Tracing to private companies not to local health teams… I’m really not excusing them.

But, I do think there are more naunced lessons that can be learned beyond “LOL WOT THESE N00B5 USING M$ EXCEL. Y U NO PYTHON?” which is an exageration, but not by much, of some of what I’ve seen online.

I’m writing this based on the following assumptions/guesses: Data had to get from one system, to another – and .xls not .xlsx was used, this hit a row limit. (This really should have been an automated feed, but that’s not what I want to explore here, I want to explore how organisations can prevent people doing ‘good’ things)

So, we’re using an inappropriate data transfer format, with a hard limit of how many rows it can contain… This sets up a few different scenarios:

Nobody foresaw this problem
The problem was known, but the decision was taken not to fix it
It was known, people wanted to fix it, but couldn’t

If we explore these, I think there’s some learning we can take away for organisations we work for or with, about how some of our anti-patterns might lead to scenarios that put us into them.

Nobody Foresaw This

This would be the most damning of the outcomes. It was a risk that nobody had realised that they were living with, and crucially that the software doing the export didn’t warn you about.

Tips to avoid it:

To borrow from the WHO: Testers. Testers. Testers. Hire decent testers, the one who infuriates you with “What if this series of 3 highly improbable events happens?”
As we’ll come onto in a second, listen to them when they say these things.

It was known about, but decisions were taken not to fix.

These aren’t fun, especially as someone who predicted a particularly nasty auto-scaling bug one time, tried to warn people, but it wasn’t accepted that it needed to be fixed until it occurred, it can always leave you feeling “if only you’d argued the case better”.

But it’s legacy…

Matt Hancock, the UK Health & Social Care Secretary, described the system as (paraphrased) “Legacy and being replaced”.

We’ve been here, a system that is old, being replaced, is considered frozen because “it’s going away”. However, I know of systems that were due for replacement in the next 6 months, but 3+ years later development hasn’t started. This was used as a reason not to do relatively trivial UX changes, that could have been a great improvement to the operators.

Tips to avoid:

Until you unplug the server, turn off the instance or stop new data flowing into it, no system is “legacy”

“It’s very unlikely… we can live with it”

Nobody, apart from epidemologists and software billionaires, predicted a future epidemic on this scale – so I guess that maybe the problem was known, the decision could have been taken to live with it. Going back to the first recommendation and hiring a tester, sometimes so many scenarios are found, it’s easy to tune out because like Cassandra, the tester is always talking about problems.

Tips to avoid:

It’s ok not to fix everything, but if you’re living with a risk, make sure it’s known, and doesn’t fail silently.
Keep it in your risk log, and actually re-read that once a quarter and assess if they’re now more of a problem.
Try to be a little less agile, at least in methodological purity, and go beyond “what we’re building next” and look a few steps ahead.

We wanted to fix it…

This is when we get into some of the most depressing collection of scenarios:

“You can’t just make a change, this needs a PROJECT”

Changes need to be properly developed, tested and deployed, but sometimes this doesn’t need a full project structure created. When all improvements are painful to implement, people just accept and build workarounds, some of which you may not be aware are in place.

Tips to avoid:

Have a lightweight process for “short-order” requests that are small.
Find ways to bundle these into bigger releases alongside the “im-por-tant” work.

“It’s too expensive”

If you have a bad contract with your supplier, it could just cost too much to viably fix.

Tips to avoid:

Only buy software/services where the API is included, and is nice to develop against (I’m looking at you SOAP)
Have clear boundaries in your systems/components, own the integrations yourself, so you can swap components or combine as required

“The person who develops it is too busy/gone away”

You could imagine that if this system was modifiable, that right now the people with IT skills are maybe elsewhere working on the other plethora of systems that have have to be spun-up to cope with the current situation.

Worse though, is when software has gone-stale and while you maybe have developers who could work on the problem, nobody really understands how to build/deploy it anymore, it’s effectively stuck.

I’ve worked with clients who had problems with code going stale, and instituted very strict “if you modify a component you must fully adapt it to be inline with our current standards” to fix this. However, this just introduced a disincentive to make minor changes to improve things, because the developers knew that alongside 5 lines of functional code changes, they had to make 500 of dependency related changes.

Tips to avoid this:

Avoid one product/system/component being solely one persons ‘thing’.
Find ways to allow people to deploy minor changes as a BAU process, gradually updating components into modern ways of working without dogmatically requiring every component to be fully updated.

In conclusion

We’ve all used excel files or CSVs in email, or a google sheet as an interim solution. The problem is that these interims become permanent and eventually they stop working. I’m lucky in that mine were about keeping TV or VOD on-air, and not about life or death statistical reporting processes.

But still, let’s tone down the sneering “BUT WHY WASN’T IT AUTOMATED” talk, yes, it clearly should have been, but none of us know the decisions being made, or the available software hooks that the operators/developers had access to.

Always monitor your systems, spot where things can be better and make the incremental improvements because they add up over time. Never invest all your hope in the new system/rewrite because they’re always years away, and usually come with their own new ‘quirks’.

The One Boring Reason Why People Use the AWS Service

One of my clients recently started using a relatively new AWS CI/CD Service, and I just stumbled on a defensive/marketing type post from one of the traditional providers. And it made me realise how much vendors can miss the reason people choose to go with the AWS/GCP/Azure service, even if it’s inferior.

Aside: I’m not going to link to the article because they don’t deserve the clicks.

Back to their post, it went through a familiar structure:

“But it doesn’t have all the features, our lovely features”
“You can’t self-host, you’re LOCKED-IN!”
“Why not buy into our broader platform?”

I’ll go through these in turn, before getting to the actual reasons.

“It doesn’t have the features…”

It doesn’t. It’s version 1 of an AWS product… they always launch very lean and gain new things.

And yes, it only supports 3 integrations while Vendor supports around 30. Turns out though those 3 are the most important ones. Others will be added I’m sure, but only where people will use them.

“You can’t self-host, you’re LOCKED-IN”

Good. I literally don’t want to.

I know that some Ops-Teams feel happier that they can touch a container or an instance, but this is a product that can be replaced quite easily, include by this Vendor should the need arise.

They do have a SaaS offering you can pay for, but it’s relatively expensive for small-teams. (And we’ll come onto legal things later)

“Why not buy into our broader platform?”

Lock-in to your cloud provider is bad, but if you use all of their products you can get a great unified experience… which sounds a little like, erm, lock-in.

The simple reason people choose the service on their Cloud… procurement

Companies generally make buying stuff difficult. Every new vendor is a new round of legal review, potentially procurement exercises. It’s a painful affair.

This Vendor does sell their SaaS platform on the AWS marketplace, but it’s another End User License Agreement (EULA) that needs to be accepted. And that means it has to evaluated by a legal-team: like most other EULAs the lawyers will probably go “Yeah, it’s got a bunch of stuff in it that nobody could ever enforce, so proceed at a tiny risk”.

When you already have a cloud-provider, and the legal/finance agreements are in place, it’s just easier to use the provided service.

The ‘default’ product may well be inferior, have less features, and even be more expensive: but if I can click “use this” without involving legal – it’s the one I’ll likely choose.

My workload is too special for Serverless

A few years back it was “My workload would cost more in the cloud”, which while I’m sure is true for some workloads, it was a small and falling amount. It fell even more when you actually costed in all the admin you were doing for your “cheap” servers.

Now it’s “my workload is cheaper on servers than serverless”. Now, again, this will be true for some workloads, but again, this percentage is falling every month as features increase.

Time for the Horror Story…

With every new technology, we need the horror story to dismiss it.

“bUt wHAT aBOUT tHe COld-StArT PeNalTy, thaT meANS tHiS IS uNusABlE fOr ME”
Serverless Function Refusenik

Yes, cold-starts are clunky, and if you’re on Amazon (at time of writing this), you cannot feasibly start a lambda into a VPC because the startup penalty is too painful. This is apparently on their roadmap for this year.

Microsoft are launching a pricing model that allows you to pay for some pre-warmed functions, which could give you the best combination of easy scaling, if the pricing is acceptable.

Anyway, for a lot of these things, the API-Gateway memory cache, or CDNs in front of your APIs should be offloading a lot of traffic and ensuring that common items are rapidly available

Stop swimming upstream

All the effort in IT infrastructure is heading towards serverless functions, container orchestration, containers without actively running container hosts. The choice of hosted database or database-like storage services we are offered can make it confusing to decide. The answer is almost never I’ll running something myself.

Shunning these modern hosting because you genuinely feel that your service is so special is choosing just to take the hard path for little reason, in nearly all cases. And someone- else will use them, have the advantage of working far more on functional code, and far less on overheads, and could offer a cheap/better product than you.

Yes, I know when you are at the scale of one of the top ten internet giants it can make sense – dropbox moved their storage to their own appliances, but you’re not really Dropbox, are you?

AWS Launches MediaConnect and almost gives us multicast

It’s Re:invent time, and Amazon have launched a new service to make video routing to the cloud reliable and easier to set-up.

A few weeks back I was at the brilliant DPP Leaders Summit, it was under the Chatham House Rule.¹ There were some great speakers, and I particularly loved the exec who, to paraphrase, “If it doesn’t work without months of professional-services, THEN IT ISN’T AN ACTUAL PRODUCT.”²

Anyway one of the speakers was facing rebuilding their entire stack due to ownership changes, and wanted to do so in the cloud. They said “We need multicast and Precision Time Protocol”. Which I can understand, for playout or production applications, the need for those two is pretty clear.

It’s now Re:invent season, which is the point in the year when AWS tend to release a lot of their good stuff. And yesterday they unveiled a new media ingest service AWS Elemental MediaConnect.

It’s a managed service to get your video signals to/from/between your Amazon clouds.

This has historically been a pain: back when I was working on the Video Factory project we initially mooted a box in the cloud that we would send the signal to, and then that would fan out to both archiving and live streaming. This was hard to do, so we side-stepped the issue, and just rapidly uploaded the stream to S3 in consistently sized chunks instead. Later something was put in place to do the streaming, using something that I don’t think has been spoke about too much in public, so I shan’t detail here.

Anyway, this new service allows you to send content to/from an endpoint using standard RTP (with/without Forward Error Correction) or the more reliable but commercial Zixi protocol. The video has an Amazon ARN identifier, which then means that external accounts can have permissions to subscribe to the stream, the documentation says a ‘flow’ can have up to 20 outputs.

How are we going to use this?

Contribution to streaming output: fire the video somewhere and you don’t have to know if/where it’s being used
Contribution for programming: using few Amazon regions, broadcasters could very easily build a global contribution network to backhaul outside-broadcasts very easily
Contribution from a Playout appliance, if your cloud playout outputs to an MediaConnect flow, then you can then output that flow to your broader distribution chain, allowing re-routing of things downstream.

It isn’t multicast within a VPC, it’s not PTP, I suspect the latency involved may be too great to allow it to be used to route between different stages in a virtual playout chain³.

MediaConnect does however simplify integrating cloud processing workflows by providing fixed points at the edges in and out of the cloud.

I’ll be interested to see how people use it.

That it is a singular rule is one of those bits of pedantry I cannot let go of ↩
This is probably a topic for another time, but the fact that so many enterprise vendors expect you to pay for their ‘product’ then explain that ‘oh, no, you can’t just use it out of the box even in a basic manner’ is a bit of a joke ↩
I could be very wrong here, I don’t have a one of those hanging around to test ↩

Data collection at the job fair

Last weekend I went to a tech recruitment event, and I was little shocked at how badly some employers did data-collection.

When enquiring about potential employers, people have a vague expectation of privacy. This is lost when:

Data collection is adding your details to a sign-up sheet, with the ability to see the details of everyone who did so before you
Data collection is adding yourself as contact on an iPad. This has all the problems of solution 1, but with the ability to send any contacts you like while you’re entering your data

Finally, don’t collect what you don’t need. Do you need to capture gender? And if you do, consider that for some people the options might not be as simple as “Male/Female”.

Recipe for success

What does a team need to deliver a successful software project, starting to think about what I’ll want in my next engagement.

There’s plenty left to do, but as I approach the end of my current main assignment as a Technical Architect, I’m starting to think what my future engagements should have.

This is my starter for ~~ten~~ five:

Anything but waterfall
Genuine Public Cloud, with a hint of lock-in
Internal users matter just as much
Partnership with your Product Owner
Embedded QA, seen as a benefit, not a drag

Anything but waterfall

Scrum? Kanban? Scrumban? I don’t really care exactly what it is, more that it works for the project, everyone understands and supports it.

I hate designing things entirely upfront, it just seems so conceited that you can genuinely design an entire system without trying to make any of it. While I know this doesn’t apply when you’re building a rocket¹ or CERN, you’re not doing that, are you?

Yes, you absolutely need a sense of roughly where you’re heading, and ideally an end goal that you’re heading towards – but you also need the pragmatism to know if you try to build that from the start, you’re going to burn lots of rubber on the road, while making very little progress.

Show your dev teams that you can and do go back to make things better. Build the sense of trust that when you say “Just build the ~~slightly-hacky~~ ‘tactical’ thing, we will fix it later” that you do go back and fix it.

You’ll free everyone up from the performance anxiety of “Must get it right first time, because I can’t go back and fix it”.

Genuine Public Cloud, with a hint of lock-in

I would like to think that cloud is a given, but I still face people who say things like “It’s just someone else’s computer” – yes, but in general they have better capacity planning than you, or the “I could do x for cheaper” – which I’m sure you could, but you’re usually not factoring in the hidden costs.

The main system we built does have an on-premise element, but it’s controlled by the cloud, and deployed in a similar way.

We host the core of the system in the cloud, and that gives us an agility in scale and deployment we don’t have on-premise. Now, could we get that in time, I’m sure we could, but then we lose the benefits of the AWS value-add services…

“we use Amazon, but we only use EC2 and we don’t use any of their special services, so we’re not locked-in”

Speaking of which, when I hear that particular line, I want to congratulate the person on ensuring they’ve deployed their software in a way that will either cost them more, or be less reliable, or both.

At some level, to get the best value out of a cloud provider, you do need to be using their value-add services, meaning you can run bits of your application server-less or other bits as more scalable state-less systems.

Yes, if you write a Lambda, you can’t instantly port that to Google Cloud Functions, but given they both run Node, provided you put the thing that does the work in a scoped module, migrating should mean you write the Google invoking code.

I’m not saying use every service, but to start with the position that you’re just going to use Infrastructure as a Service, is too dogmatic.

Internal users matter just as much

Yes it’s an internal system. Yes it’s not public facing.

Yes it should still be as performant and usable as your public properties.

Facebook probably does more than your system. Facebook is generally fast to use, and yet nobody gets training in how to use it. If your system requires lots of training, are you doing things as well as you could?

Consumer technology and services are good. Very good. Your users expect your system to match that, and when you give people tools that work well, they’re freed from hating the system they are using, and allowed to actually focus on the tasks they’re doing.

Focussing on my current engagement, a partnership with our core users meant they took up some extra manual working, while we ran the extended migration. They only agreed to those once we had earned their trust, and they realised that “could you do this for 3 months” was just that. (granted it was more like 4 months).

Partnership with your Product Owner

Product Management is still a relatively new discipline, so there is no one-true-way, and I hope there doesn’t become one, because not all products are the same.

Regardless, partnership with your Product Owner is crucial, and if they’re technical you want to work hand-in-hand with them on key design decisions. If they’re less so, you need their trust and for them to delegate responsibility.

Embedded QA, seen as a benefit, not a drag

The embedded tester in the team is a key resource. They should ask questions, spot the things we didn’t, and invariably are a first call for “do we know what happens in situation x?”.

For all the frustration that Test Driven Development can cause when doing genuine micro-services, the testing framework that provides means that we never ship the same bug twice. Sometimes when we’ve suspected bugs, modifying an existing test have helped us check our hypotheses quickly.

Easy regression testing make you far more able to build and iterate quickly.

In conclusion

You can’t make a project be a success, but there are things you can do that increase the chances…

And talking of rockets, look at what SpaceX have done, which looks pretty like rapid evolution of a rocket platform adding more capabilities… ↩

Re-use more than code?

“You can just re-use the code from x, can’t you..?” is a common call in organisations, but does it always make sense?

I’ve been working on a project recently, and when it started, we were “just going to use the components from <another project>”.

You’ve written many lines before, so why wouldn’t you re-use them? In the abstract it seems a pretty sensible thing, but it rarely works so much in practice.

It’s unlikely your company is writing something as fundamental as a security library where the domain is fixed or as universal the company Active Directory, where you only need one.

What you likely have done is a series of tactical solutions that meet the needs of each silo, which isn’t a bad thing because they’re probably bits of code that were delivered. How often have we waited for the ‘generic’ solution that didn’t really work for anyone.

Now I’m not saying that where it’s genuinely re-usable, you shouldn’t avoid code re-use. If the domain is simple and generic enough, converge on one library. But code isn’t the only thing you can re-use.

Going back to the specific example, I spoke to the architects from the project that we were just going to lift-and-shift; and we discussed how the new things that AWS launched made much of it moot, or far more heavyweight than you’d build if you were starting from today. “You could re-use this, but why don’t you look at doing that” was the outcome.

Instead the value came from, speaking about the things that they couldn’t (feasibly) change now, but would want to, “We have too much data in this account, and we can’t ever move it”. We used those as a basis so we didn’t end up in the same situation.

Experiences and things learned along the way, are just as valuable as avoiding writing some code.

Some great new/newish podcasts

Searching for a new podcast after Serial, there are loads of them to choose from right now.

Podcasting, after many of the UK newspapers pulled out of it, is going through a resurgence, here are some suggestions of additions to your listening list if you’re feeling a bit lost without Serial.

(I’ve still not listened to Serial, please don’t hate me).

NPR’s Invisibilia is from the same stable as RadioLab, but isn’t quite as heavily produced. Delving into the mind the first few episodes have been really enjoyable.

Alex Blumberg’s (formerly of This American Life and Planet Money) meta-podcast Startup about the launch of his podcasting empire (the one about the mistake is great listening to everyone who’s ever made one in business) has already stolen the hosts of internet show TL;DR to given us Reply All. Basically the same format, quirky stories about people and the internet.

Meanwhile back at WNYC TL;DR has a new host, and is still worth a listen.

Finally, Helen Zaltzman from Answer Me This now hosts a show about words, The Allusionist. It’s much shorter than AMT, and the first episode describing her suffering at her family’s puns will be all to real to anyone who listens to The Bugle.

You’ll be literally drowning Mail Chimp mentions and Square Space promo codes, did you know they’ve just launched Square Space 7 which integrates Getty Images… THEY’VE GOT TO ME.

Blogging about your Cloud Tech is only interesting when it’s Novel

If you’re blogging about moving to the cloud, you have to write about the interesting things in your migration, and not just how you did Best-Practice.

So a while back I bitched about Why The Cloud Is Oversold, talking more generally about the supposed other-wordly experience that having Sensibly Flexible Virtualised IT is… well I’ve a new pet-hate: organisations Overselling Their Adoption Of The Cloud.

I know transparency is good. It’s also pragmatic because if the information is on a computer that is even near another computer that’s on the internet, it’s going to be leaked.¹

It’s genuinely interesting when people share the unique work they’ve done. Especially when Public Bodies do stuff, look at how much gov.uk open-sourced, and how much of that govt.nz reused. We’ll not mention that Scottish Government developers can’t access the gov.uk repo as GitHub is a blocked “file-sharing” site.

The team I worked with at the BBC have spoken widely about how they turn ongoing streams of video into neatly segmented files, that are uploaded to S3 at more than 1 gigabit a second, and how these are made into the things you see on /iplayer.²

Alongside the stuff that’s of sufficient scale to be interesting, Video Factory also uses a load of standard enterprise patterns: micro-services, communication through queues, separation of concerns, etc… They’ve spoken about these, but very much in a “we’re just doing best-practice after big monolithic system pissed us off too much” way.

Anyway, I just read a blog post, by another public body documenting their transition to the Cloud and a new Responsive Website.³

Turns out sometimes they get a lot of load, and this is a problem they’ve had to solve. I’ll give you a second to think about how you’d solve bursty load on AWS.

Have you guessed?

They’ve only cached the site behind varnish, and are running that in an auto-scale group behind a Elastic Load Balancer.

That’s a pretty standard best-practice. Perhaps the novelty is that they’re a Public Sector body doing a sensible thing.⁴

But best-practice, by its very definition, just isn’t interesting blog-fodder: “Hey, We Do The Thing That Everyone Else Is Doing”.⁵

This leaves me wondering what next from this organisation:

“Our Windows PC Estate uses Microsoft Update Server to ensure they’re patched”
“We make our endpoints run anti-virus and disable USB ports on front-line single-use machines”
“We use Active-Directory federation to provide single-signon across all of our desktop applications”

If we’re really lucky maybe they’ll tell us: “How We Use Chaos-Monkey to Simulate Cloud Error-Situations”

I can’t wait.

That is an exaggeration, but not nearly as much as I’d like it to be ↩
I helped make this bit and I’m still disproportionately proud of it ↩
The kind you hate on the desktop because of all the white-space, and where the custom fonts don’t look quite right ↩
I could link to numerous projects here, so here is a small selection of failure ↩
Netflix get to do it, because they’re the one of the groups setting out best-practice in AWS ↩

Perfect is indeed the enemy of good

The desire to do things well stops us doing them at all.

I re-connected with someone on linked-in the other week. (Yes, I actually use it like that). And he sent a lovely, long detailed reply. One that I was delighted to read. One that I want to reply to.

But I haven’t.

Anytime someone sends me a nice, long, structured message, on pretty much any medium, it falls into the awful silo of “well i need to sit down and write a nice reply”.

And it stays in that silo, along with all the other things like that.

So instead, I’ll write a little blog post about not being able to write, using up some of my daily word-quota in the process, and making the writing of the reply, even less likely.