Monoliths or Microservices: how about a middle way?

Should we deploy as micro-services or monoliths, how about neither.

The latest argument that we’re having again, is how we should deploy our systems, and we’re asking “micro services” or “monolith”.

Now, I’ll try to skip past what we mean by all of those things (because it’s covered better elsewhere), but in essence, we’re asking “does our software live in 80 repos or 1 repo?”.

TL;DR How about we aim for 8?

What does good deployment/development look like

In an ideal world, we’d have the following properties in our deployment:

  • It would have appropriate tests and automation, so deployment is easy and doesn’t feel risky
  • The potential impact to a deployment should be predictable, something over shouldn’t impact something over here
  • It should be clear where to look to change code

Problems with Micro-services

  • If you go properly granular, it can be difficult to know which repo code resides in – if you pick up a ticket, you should need to spend 15 minutes to identify where that codes live
  • Deployment of related services may need to be coordinated more closely than you’d like, ensuring that downstream components are ready to accept any new messages/API calls when they arrive
  • Setting up deployment for each new component can be time-consuming (Although with things like CDK/Terraform etc, it should be possible to template much of this to a config file for the deployment system)

Problems with Monoliths

  • Code can potentially leak into production more easily – requiring more robust feature-flagging to hide non-live code. While this is good practice, it becomes a requirement in larger repos – you can avoid this by ‘dev’ deployments not being in trunk, but that’s a different kind of deployment complexity
  • Spinning up another instance of “the system” for testing of a single component may be more expensive and fiddlier than duplicating an individual component
  • The impact of a deployment may not be known, you may need to assess if other commits included in what you’re putting live could break things, this may increase deployment friction

How about Service Cluster Deployments

In a prior engagement, we built what was really a task management system.

  1. Messages would arrive which could potentially make new tasks for the system, or update existing tasks: these were handled by the task-creator-and-updater
  2. The task-viewer would access the database of tasks, cross reference with other services, and create a unified view of the task list
  3. An automation component would use the output of of task-viewer to initiate actions to resolve the tasks, which would ultimately result in more messages arriving, which then updated the task database

In our deployment, these components were all in 3 different projects, the micro service model. And it worked, but is also an example of where these 3 components could be combined into one functional service repo.

This makes sense to me because the 3 services are closely coupled, especially between the task-creator-and-updater and the task-viewer. So maybe they could have been in a combined repo task-management

With this setup I could still feel safe doing a deployment on a Friday afternoon to one component, because even if the task management system failed entirely, the manual processes were in place to allow recovery until the system could be rolled back.

Meanwhile another one of our components, the cost of a failed deployment was so high, and even if it was recovered the time-critical nature, meant we only deployed during ‘off-peak’ periods of the week. Could it have been made more robust? Probably, but it was also a relatively static system – that effort was better spent on other components that were more ‘active’.

In summary

Your deployment should work for your team. It should be based on templated conventions that allow easy configuration of new deployments, and it should be as granular as makes sense.

Instead of worrying about being “truly micro-services” or “fire & forget monolith” find the smallest number of functional groups to keep your code in. That way you can have scope-limited deployments, without having hundreds of repos.

Finally, please, just-name-your-repos-like-this, it’s funny at first giving things amusing names, but honestly, kitchen-cooking-oven is far more supportable than the-name-of-a-dragon because it gets really hot.

Things that it was good we learned this year

Rather than a “hey look at the good that actually happened in 2020” I’d rather take a more realistic view: there are a whole bunch of things that are very good for us to have learned/been reminded of – but that aren’t necessarily good

Medical Science can do amazing things, when we want it to

  • The virus was genetically sequenced within a month of it being discovered
  • The speed at which we got a whole range of vaccines/vaccine approaches into trials was incredible – that many of the vaccines were made never having been near the virus, is incredible
  • RNA vaccines could revolutionise things – we’ve shown efficacy and some have suggested that if we have vaccine templates in place for major families of pathogens we could have the ability to rapidly respond to the next pandemic
  • Clinical trials, such as the UK “Recovery” trial found existing drugs that have improved survival rates

Turns out we need society and collective action

  • Masks help other people more than they help you… without that sense of shared responsibility their use won’t be consistent
  • Local Mutual-Aid groups popped up around communities, and people helped those shielding or isolating

The situation has codified and made visible inequalities

  • People with jobs where they can comfortably work from home, have been shielded from a lot of the worst of the situation
  • They can likely afford to buy delivery food, order online – and effectively outsourcing their risk to someone else
  • Freed from that casualisation of employment, they have far less concern if asked to isolate
  • Contactless payment everywhere is great for those of us who hate carrying cash, but less good for the under-banked who find it hard or impossible to get a contactless card or account

The casualisation of employment hurts us all

  • Casual workers tend to move around between clients and sites more, which in the case of Melbourne’s Aged Care facilities, meant that they were in a prime position to move the infection between facilities
  • Casual workers also can’t sit at home as easily waiting for a test results because they have bills to pay, or to isolate when they’ve had a contact.

Not all Key-Workers where a fancy uniform

  • Health service employees have had a terrible time, working in stressed conditions, seeing more death than any of them signed up for. But they did sign up to work for a critical function in society.
  • The same can’t be said for people who work on the tills in the supermarket. We discovered that food-retail is pretty important – but until now those people didn’t get any of the cachet for that role
  • Similarly delivery drivers, postmen, transport workers… the list goes on, but there are numerous roles, that are occasionally looked down on, that turn out to be vital

Food supply chains maybe don’t need to be quite so lean

I love a just-in-time manufacturing supply chain as much as any logistics nerd, but while it makes sense for car-parts, it maybe doesn’t make sense for everything

  • Food systems adapted, but initially they were shown wanting for supplies (interestingly packaging rather than the food goods in some cases)

5 days a week in the office is dead, remote learning can be a thing

  • Turns out that most jobs can almost be done at home – knowing this might help people less able to travel due to health conditions get work
  • 5 days a week at home isn’t great though: it can be lonely, cramped, maybe not possible if you’re in a shared home. I suspect we’ll end up with most people doing 2-3 days a week in an office and the rest remote
  • Local co-working spaces may well pop up – many people who want to get out of the house may not need a commute, they just need a comfortable place to work where they can see people, just not their colleagues
  • Many disabled people have tweeted, in justified disgust, that they asked for remote learning for years, and were told “it wasn’t possible”. Turns out that lecturers can load Teams when able-bodied people need that facility

The Economy needs to adjust to the end of Big City Centre Offices

  • Our economy is too service sector heavy: Jobs have been lost because people aren’t coming into city-centres anymore. I sympathise with those employees & unlike a govt campaign (actually from months before the pandemic) not all of those people can retrain in cyber
  • Despite this hurt, and what politicians/commercial property owners say in press op-eds: we are not obliged to support the service-sector in the city centre as a patriotic duty
  • Some of those jobs may be relocated to places nearer where people are now, (personally I’ve still bought far too many takeaway meals while WFH)
  • Going forward, do we need more mixed-use neighbourhoods like you see in mainland Europe, driving the demand for services, not driven by office-workers

The NHS

To those people working in the NHS, I think we all thank you, and hopefully we’ll find ways to do It that go beyond clapping

  • While no health service truly coped with this, years of recruitment shortfall have exacerbated difficulties in the NHS response
  • While impressive how quickly we built the Nightingale hospitals, because admitting hospitals needed to provide staffing and equipment for the patients, it made little sense for them to be used, resulting in them being white elephants
  • When freed from arbitrary government targets, the NHS can radically reconfigure itself when it needs to.

Personal Protective Equipment (PPE)

  • The entire world shouldn’t outsource production of PPE to one area in China, that happened to be hard hit by the outbreak – we should have some on-shore production that could be ramped up at times of crisis
  • PPE Management shouldn’t be given to random companies – this is a UK specific one, but we love outsourcing things, and sometimes you can’t just treat specialist things as commodities
  • Those PPE stockpiles should be actively managed – rather than building up emergency stockpiles – we should always be taking stuff out of them and replenishing, that way we don’t end up with years old PPE that needs to be re-certified for use, which doesn’t instil confidence

UK Government Response

  • The Prime Minister undermining the response before it even began, when the government message was “Don’t go home for Mother’s Day” and the PM was quoted “I’m still hoping to see my Mum” wasn’t a good look. Nor was sanctioning his advisor for taking a unique eye-test in Durham
  • Any effective pandemic response always feels excessive, because it’s done before it feels needed. Having a PM who is unable to make decisions before his hand is forced, doesn’t work so well
  • The UK needs to give up its obsession with throwing problems at generic business process outsourcing companies, they do a shit job time and time again, and have take far too much money for a sub-standard response
  • Continuing the above, local contract tracing teams work better than centralised teams – replicated in other places, not just the UK
  • After a brief period of IT improvement though GDS, the UK can once again have poor IT processes waste money with private IT companies: just look at the money spent on the initial Bluetooth app that everyone who understood iOS restrictions told them wouldn’t work
  • Worse the failed excel handover mechanism that cost lives because contacts weren’t followed up. I know that people do things at haste in situations like that… but still

We need to address online-misinformation

  • There have always been contrarian conspiracy people, but without the distribution channels of social media, their impact has been limited: their actions undermine trust in the response, vaccines. etc.

Can we stop with the (inappropriate) gatekeeping?

It’s another week, so it’s time for everyone’s favourite game: Gatekeeping.

In particular this example Chloe (a Senior Developer Advocate for Microsoft who does some cool stuff with code, while putting up with being a woman in tech on twitter) posted this:

Now there are a whole variety of reasons for this being a good thing, there’s evidence that diverse teams, while sometimes being worse at doing repetitive/samey tasks than less diverse teams, when thrown new problems do better.

Also, having people who aren’t white comp-sci males on a team leads to picking up on things, like an awareness of how your product might be mis-used. Abusers have used Venmo to send money to their victims, because “why would you want to stop someone sending you money”.

Of course, a man was here to quibble advise:

Now, machine-learning is an interesting discipline to pop up and claim that inexperienced people aren’t going to do a good job… we’ll go into that in a second.

Yes, it’s probably true that someone starting out will not be able to generate an entirely new model. But will they be able to follow tutorials and train one of the existing models? Likely yes.

Will they be able to replicate the many mistakes that ‘pro-fess-ion-al’ machine learning engineers have? Absolutely.

Machine learning has been used to codify our biases. Facial recognition performs worse on non-white faces… “flight risk assessment algorithms” which are commercially sensitive so can’t be audited, seem to report that certain communities are more of a risk.

Meanwhile there was that time that a “cancer detection” model, had actually been trained itself to detect the different colour of slide-frames that were used between control and malignant samples.

I’m just saying, that maybe Machine Learning isn’t yet the rigorous pillar of integrity and correctness that needs protection to preserve its pureness.

“React is for n00bs”

This is another good one.

When new devs start out and they use react, a variety of callouts appear:

  • “It’s too complicated, they need to learn the basics”
  • “React is too heavy, they need to learn to optimise”
  • “the amount of javascript we use on the web is too high and a security risk”
  • “if you don’t learn the basics of DOM manipulation how can you possibly do it well”
  • Server-side rendering of client-side apps is just a return to the old way
  • We shouldn’t be building apps on the web

Most of these are true to a greater or lesser extent, but you know what else is true?

This is what the web looks like now…

It is not where any of us would probably start, but it’s where we are.

Having architected a business system that uses React as the UI, that system would have been painfully unusable if every interaction was a page load on form reload… modal popups and API calls made it a better experience for users.

“They’re building unoptimised systems and that’s not good”

That is also true, however how do you learn to build an optimised system?

You ship something that gets to the point is needs to be optimised. Many systems don’t need to be… Good enough, is, well, good enough.

These things are analogous to scaling problems: if you get them, they’re nice to have.

We do want some gatekeeping

I don’t want a newbie coder to write the control software for a nuclear reactor… This is unlikely

But more realistically, the area that we need to find ways to help new programmers about about the basics of security.

I don’t want a newbie writing a user registration system, there are plenty of managed Identity Providers (IDP) out there like Auth0, Cognito, AzureAD, Login with Google, Login with Apple etc…

So yes, I wouldn’t want a newbie writing an IDP of any complexity, I can see them storing passwords in cleartext in a mysql database.

But we don’t talk about these things, or how we can give new programmers an intro to the “easy” 80% of security things: basic security on APIs, not storing secrets in your app, not using sequential/predictable IDs around the place.

It’s much more foundational “go and learn enough before we deem you WORTHY of writing for the web”.

Some people learn by doing a CompSci degree. I have one of those.

While it taught me a bunch of formal things, so much of what I’ve learned is by working with good people, making mistakes, and learning more.

I learned React in part because I was working with a bunch of coders who were learning it…As an old school HTML, JS, JQuery & CSS person, I was initially confused and scared of it. Then create-react-app appeared and I finally got it.

If we don’t turn down this obsession of gatekeeping entry, we don’t let new people learn.

We end-up with the same faces, and products will be worse for everyone. Us older-school people will get stale, stagnate and just write the same stuff until we get retired.

We can nudge better than with streaks…

The brittleness of breaking a streak needs to be broken somehow.

So yesterday I had my phone swiped by someone who was on a bicycle, this was thankfully one of the few times I’ve experience crime in my life, and was non-violent, and the phone has been remote wiped.

This is annoying, leaves me a little shaken, and also annoyed as the thing will likely be dismantled for parts. It’s locked and while I know the that is imperfect, it’s work to remove. But, I have insurance, so it’s all good mostly ok.

But, frustratingly because my phone was gone, I just lost my 205 day Activity streak from my watch.

This was slightly annoying because one bit of Apple did know I’d moved, the “with friends” feature, but I suspect that is just sharing 6 numbers every few hours (the target and amount done for the 3 categories). This isn’t the “health” database, which is a far more granular time-series database, so the streak is broken.

Adding insult to injury theft, the watch even congratulated me for hitting my goal and extending my streak; unaware the achievement would disappear into the electronic void.

And this is the problem with these things, my mind (at least) goes to a place that “getting back to 205 days will be hard, it’s winter”. This now becomes a bit “why bother” rather than incentive.

The same thing applies to Duo-Lingo, Headspace, etc. When it’s a daily thing, and you fall off the wagon, it feels difficult to get back on. I know it might not be as effective, but what if I want to do 3 days of week of each – my 10 minute self-improvement slot.

David Smiths MyTrends++ has the concept of rest days, if you get 7 days on, you’re allowed 1 day off. I think that works nicely.

However, these things also don’t account for if you got ill for a few days and couldn’t exercise. I hate to cite the UK privatised rail system, but they are (or at least were) allowed to declare void days, when evaluations didn’t count – I guess charitably you could say these were fair when absolutely awful weather hit.

If you’re ill and not feeling able to exercise for a few days, that’s (probably) not your fault. But I think, paradoxically, the bigger the streak the bigger the “I’ll never get there again” feelings that arise.

Amazon’s new Halo wearable judges you over a week (if I remember correctly from a podcast), rather than daily. Which I think it’s a more sensible, and understanding of your life.

I don’t know how to fix this as I’m not a behavioural psychologist, but I wonder if you should be earning the equivalent of “long service leave” – maybe you earn one cheat-day a month, up to a maximum of 5?

I don’t know the solution to this, but I think the current obsession with daily gamification goes isn’t really that great, and I know we can do better.

A feature request for LinkedIn

I like LinkedIn, but I would love if I could make recruitment messages more relevant.

I’m about to whine about recruitment, which I understand isn’t great when many good people are looking for work.

If you can do anything to help people in your network, recommendations, connecting people up – now is the time to lower your reputation-risk considerations (what if they aren’t a match, aren’t good) and do it anyway.

Although I dislike the Storification of LinkedIn, and find “Heart Warming Stories of Dubious Origin, About That One Time Someone Showed Basic Human Empathy” posts a little grating, I like LinkedIn.

I primarily work in the Media & Entertainment industry, and very often people move around. One time I was working with a team who were re-engineering a high profile transcode stack, and we needed to check compatibility that one consumer with very Fussy Set-Top Boxes specific H264 encoding parameters.

Searching on LinkedIn found that someone I’d previously worked with was now there, and that was one of those useful back-channels that actually get the work done, alongside the formal ones where invariably detail is lost in all the mediation layers.

I’ve previously found work through LinkedIn also, people in my network were looking and we had chats…

In both of these cases it was a route to contact people who I likely wouldn’t have managed otherwise.

The Bit Where I Bitch About Recruiters

While I know #NotAllRecruiters, many are somewhat annoying.

I’m quite specific in my profile intro of the kind of roles I’m open to, and still I get requests to be a: Permanent, SAP, Project Manager, in Bracknell.

That’s one technology I’ve never worked with (merely around) and 3 job qualities that I will avoid.

Tiresome for everyone, a waste of my time to read and theirs to send.

The over-engineered solution

As mentioned, I’ve a number of relatively simple conditions about jobs I’ll consider.

One time I got a message about a job that was “Only for Oxbridge graduates, but Imperial is also OK” – I know this was meant to be flattering and give the impression of an intellectual workplace (while also being a bit negging that “Imperial was almost good enough”). However, it just screamed of a horrendously toxic culture with Platinum Grade Gatekeeping.

So if you’re specific about what you’re looking for why don’t you get to state that in some questions, and when a recruiter who isn’t in your network wants to contact you, how about they’re given a page like this… (please excuse the 💩 mock)

A list of questions a recruiter might face: is the position permanent or contract, using appropriate technologies, what the salary is

Actually Maybe This Is Application for ML…

As I was writing this (helpfully after doing the 💩 mockup), I thought of a much better solution: If you can choose from a smaller range of criteria – and ones that could be detected by an ML classifier – LinkedIn could just run the classifiers you care about on an “out of network” message.

The score of the message could then drive a traffic light system: the message is accepted, outright denied, and if borderline the sender needs to click a “Yes, it’s appropriate and your classier is wrong, scouts honour, promise” button.

Would it work?

Unless there was a penalty for clicking “This Isn’t Spam” I doubt it would.

I also suspect it would hurt LinkedIn’s revenue too much, if having paid for Gold Premium Ultra, people aren’t able to send messages

To the good recruiters, who like great project managers are rare but invaluable – I’m sorry.

To the rest of you, I’m just not ready to do SAP in Bracknell.

Prescriptive Software Practices: Code Re-use Edition

Individual software practices don’t exist in a vacuum, and need to be viewed collectively.

Today I saw this tweet, that I initially violently agreed with, before realising the answer is really more “it depends”.

Now I fully agree that demanding that people write the abstraction layer before they’ve even written the first component to use the underlying tool, is a folly that leads to bad libraries. You don’t know how to best use the underlying API, and you don’t know how you want to use it, and which of the methods you want to wrap or enhance.

The requirement to wrap every ‘method’ is the main reason I dislike intermediate libraries, one time I asked “are we using this new AWS feature that’s perfect for our use case?” The answer: “No, we can’t because Terraform doesn’t support it yet.”

Any time you put something in-between you and the underlying service you’re introducing a potential roadblock. I’ll explain later how I think you can minimise this.

The main reason I think code-reuse/libraries are hard to get right is a conflict at the core of them:

  • A trivial library can be simple to use, but if the functionality is simple, what is it really adding?
  • A feature-filled library is usually (but not always) harder to make use of, and if most people only use a fraction of it, what makes it worth they overhead?

Things don’t exist in isolation…

Warning, inbound analogy: Very often “we” like to look to other counties and cite how wonderfully they do a thing. An example from the UK is that we’re often told that “people shouldn’t mind renting flats because Germany people tend to buy later.”

Which sounds great, but when you point out that Germany has a bunch of related things – longer leases, more freedom to decorate/change properties, and that they consistently build houses to maintain far more modest house-price rises – people tend to go quiet.

Returning to software, everything is similarly related and supported by other practices. If you don’t fully understand a problem, you can’t cleanly decompose it in a sensible collection of services, and only when you’ve done that will sensible opportunities for code re-use/libraries emerge. (At this point you’re welcome to argue that if you’ve decomposed your system properly then you should need to reimplement functions).

XP/Agile/Clean Code/BDD/TDD/… can become quasi-religious in how much you must adhere to all of their tenets. I suspect very few people are fully compliant with any one tribe, and to be effective as teams you need to view things are recommendations or possibilities, and not commandments that thou shall obey.

How to do code re-use right…

This is just my experience, but a few questions to ask or points that I’ve found have worked for the people I’ve worked with in the past:

  • Avoid needing them in a first place: if your transaction volume is low enough just have a dedicated service that does the particular thing… A single point of truth is the easiest way, but that isn’t always possible due to latency or cost concerns
  • Consider Security/Auth/Data-protection first: These are things that you need to create decent libraries/patterns for, because if the easiest thing is the right thing, you’re going to be making fewer critical mistakes, and it can make patching easier if you’re exposing a consistent interface but have to update an underlying library with breaking changes
  • Judge the demand: While many times people can be “wow, I didn’t realise I needed x until it appeared” unless it’s really obvious that lots of people have the exact problem, do you really need to write a library?
  • Understand it before you abstract it: Don’t write them first. My ideal preference is that when you have a few teams working in the domain, let them create distinct implementations. Afterwards, regroup and use that learning as the basis for a library. This is more work, but the result will be much better
  • Keep the library fresh: Is it one person’s job? Is it a collective whole-team effort? A library needs to be a living thing as the systems it interacts with will change. Developers will rightly shy away from using a clunky piece of abandoned code
  • Layer up in blocks: a client has a back-end system with specific authentication requirements and has been building out client libraries. There are 3 distinct libraries: connection management, request building and result parsing. You didn’t have to use all of these, and can just use the connection library if you want more direct access
  • Make your library defensive but permissive: TypeScript has got me back into typing, but previous experience makes me nervous. In micro-services environments a library update can require many unrelated deployments, when only be two components are functionally involved. Errors because enums aren’t valid can be useful, but can you expose the error when that property is accessed rather than parsed?

In summary…

Teams need to find their own path, and find where on the line between “Don’t Repeat Yourself” and “Just Copy-Paste/Do Your Own Thing” they lie. It is highly unlikely to be at either extreme.

“It Depends” isn’t a particularly enlightening answer, but like so many things about building decent products, it is what it is.

On THAT Excel Issue

What can we actually learn from the government’s Excel related issues?

There have been many comments posted in the last week about “excelgate” or whatever we want to call a life-threating data exchange problem. This post is not about absolving the government of blame for this, or the countless failings they’ve made across the Test & Trace programme. Between the app that everyone who understood iOS Bluetooth told them wouldn’t work, giving the bulk of Contract Tracing to private companies not to local health teams… I’m really not excusing them.

But, I do think there are more naunced lessons that can be learned beyond “LOL WOT THESE N00B5 USING M$ EXCEL. Y U NO PYTHON?” which is an exageration, but not by much, of some of what I’ve seen online.

I’m writing this based on the following assumptions/guesses: Data had to get from one system, to another – and .xls not .xlsx was used, this hit a row limit. (This really should have been an automated feed, but that’s not what I want to explore here, I want to explore how organisations can prevent people doing ‘good’ things)

So, we’re using an inappropriate data transfer format, with a hard limit of how many rows it can contain… This sets up a few different scenarios:

  • Nobody foresaw this problem
  • The problem was known, but the decision was taken not to fix it
  • It was known, people wanted to fix it, but couldn’t

If we explore these, I think there’s some learning we can take away for organisations we work for or with, about how some of our anti-patterns might lead to scenarios that put us into them.

Nobody Foresaw This

This would be the most damning of the outcomes. It was a risk that nobody had realised that they were living with, and crucially that the software doing the export didn’t warn you about.

Tips to avoid it:

  • To borrow from the WHO: Testers. Testers. Testers. Hire decent testers, the one who infuriates you with “What if this series of 3 highly improbable events happens?”
  • As we’ll come onto in a second, listen to them when they say these things.

It was known about, but decisions were taken not to fix.

These aren’t fun, especially as someone who predicted a particularly nasty auto-scaling bug one time, tried to warn people, but it wasn’t accepted that it needed to be fixed until it occurred, it can always leave you feeling “if only you’d argued the case better”.

But it’s legacy…

Matt Hancock, the UK Health & Social Care Secretary, described the system as (paraphrased) “Legacy and being replaced”.

We’ve been here, a system that is old, being replaced, is considered frozen because “it’s going away”. However, I know of systems that were due for replacement in the next 6 months, but 3+ years later development hasn’t started. This was used as a reason not to do relatively trivial UX changes, that could have been a great improvement to the operators.

Tips to avoid:

  • Until you unplug the server, turn off the instance or stop new data flowing into it, no system is “legacy”

“It’s very unlikely… we can live with it”

Nobody, apart from epidemologists and software billionaires, predicted a future epidemic on this scale – so I guess that maybe the problem was known, the decision could have been taken to live with it. Going back to the first recommendation and hiring a tester, sometimes so many scenarios are found, it’s easy to tune out because like Cassandra, the tester is always talking about problems.

Tips to avoid:

  • It’s ok not to fix everything, but if you’re living with a risk, make sure it’s known, and doesn’t fail silently.
  • Keep it in your risk log, and actually re-read that once a quarter and assess if they’re now more of a problem.
  • Try to be a little less agile, at least in methodological purity, and go beyond “what we’re building next” and look a few steps ahead.

We wanted to fix it…

This is when we get into some of the most depressing collection of scenarios:

“You can’t just make a change, this needs a PROJECT”

Changes need to be properly developed, tested and deployed, but sometimes this doesn’t need a full project structure created. When all improvements are painful to implement, people just accept and build workarounds, some of which you may not be aware are in place.

Tips to avoid:

  • Have a lightweight process for “short-order” requests that are small.
  • Find ways to bundle these into bigger releases alongside the “im-por-tant” work.

“It’s too expensive”

If you have a bad contract with your supplier, it could just cost too much to viably fix.

Tips to avoid:

  • Only buy software/services where the API is included, and is nice to develop against (I’m looking at you SOAP)
  • Have clear boundaries in your systems/components, own the integrations yourself, so you can swap components or combine as required

“The person who develops it is too busy/gone away”

You could imagine that if this system was modifiable, that right now the people with IT skills are maybe elsewhere working on the other plethora of systems that have have to be spun-up to cope with the current situation.

Worse though, is when software has gone-stale and while you maybe have developers who could work on the problem, nobody really understands how to build/deploy it anymore, it’s effectively stuck.

I’ve worked with clients who had problems with code going stale, and instituted very strict “if you modify a component you must fully adapt it to be inline with our current standards” to fix this. However, this just introduced a disincentive to make minor changes to improve things, because the developers knew that alongside 5 lines of functional code changes, they had to make 500 of dependency related changes.

Tips to avoid this:

  • Avoid one product/system/component being solely one persons ‘thing’.
  • Find ways to allow people to deploy minor changes as a BAU process, gradually updating components into modern ways of working without dogmatically requiring every component to be fully updated.

In conclusion

We’ve all used excel files or CSVs in email, or a google sheet as an interim solution. The problem is that these interims become permanent and eventually they stop working. I’m lucky in that mine were about keeping TV or VOD on-air, and not about life or death statistical reporting processes.

But still, let’s tone down the sneering “BUT WHY WASN’T IT AUTOMATED” talk, yes, it clearly should have been, but none of us know the decisions being made, or the available software hooks that the operators/developers had access to.

Always monitor your systems, spot where things can be better and make the incremental improvements because they add up over time. Never invest all your hope in the new system/rewrite because they’re always years away, and usually come with their own new ‘quirks’.

The One Boring Reason Why People Use the AWS Service

One of my clients recently started using a relatively new AWS CI/CD Service, and I just stumbled on a defensive/marketing type post from one of the traditional providers. And it made me realise how much vendors can miss the reason people choose to go with the AWS/GCP/Azure service, even if it’s inferior.

Aside: I’m not going to link to the article because they don’t deserve the clicks.

Back to their post, it went through a familiar structure:

  1. “But it doesn’t have all the features, our lovely features”
  2. “You can’t self-host, you’re LOCKED-IN!”
  3. “Why not buy into our broader platform?”

I’ll go through these in turn, before getting to the actual reasons.

“It doesn’t have the features…”

It doesn’t. It’s version 1 of an AWS product… they always launch very lean and gain new things.

And yes, it only supports 3 integrations while Vendor supports around 30. Turns out though those 3 are the most important ones. Others will be added I’m sure, but only where people will use them.

“You can’t self-host, you’re LOCKED-IN”

Good. I literally don’t want to.

I know that some Ops-Teams feel happier that they can touch a container or an instance, but this is a product that can be replaced quite easily, include by this Vendor should the need arise.

They do have a SaaS offering you can pay for, but it’s relatively expensive for small-teams. (And we’ll come onto legal things later)

“Why not buy into our broader platform?”

Lock-in to your cloud provider is bad, but if you use all of their products you can get a great unified experience… which sounds a little like, erm, lock-in.

The simple reason people choose the service on their Cloud… procurement

Companies generally make buying stuff difficult. Every new vendor is a new round of legal review, potentially procurement exercises. It’s a painful affair.

This Vendor does sell their SaaS platform on the AWS marketplace, but it’s another End User License Agreement (EULA) that needs to be accepted. And that means it has to evaluated by a legal-team: like most other EULAs the lawyers will probably go “Yeah, it’s got a bunch of stuff in it that nobody could ever enforce, so proceed at a tiny risk”.

When you already have a cloud-provider, and the legal/finance agreements are in place, it’s just easier to use the provided service.

The ‘default’ product may well be inferior, have less features, and even be more expensive: but if I can click “use this” without involving legal – it’s the one I’ll likely choose.

My workload is too special for Serverless

A few years back it was “My workload would cost more in the cloud”, which while I’m sure is true for some workloads, it was a small and falling amount. It fell even more when you actually costed in all the admin you were doing for your “cheap” servers.

Now it’s “my workload is cheaper on servers than serverless”. Now, again, this will be true for some workloads, but again, this percentage is falling every month as features increase.

Time for the Horror Story…

With every new technology, we need the horror story to dismiss it.

“bUt wHAT aBOUT tHe COld-StArT PeNalTy, thaT meANS tHiS IS uNusABlE fOr ME”

Serverless Function Refusenik

Yes, cold-starts are clunky, and if you’re on Amazon (at time of writing this), you cannot feasibly start a lambda into a VPC because the startup penalty is too painful. This is apparently on their roadmap for this year.

Microsoft are launching a pricing model that allows you to pay for some pre-warmed functions, which could give you the best combination of easy scaling, if the pricing is acceptable.

Anyway, for a lot of these things, the API-Gateway memory cache, or CDNs in front of your APIs should be offloading a lot of traffic and ensuring that common items are rapidly available

Stop swimming upstream

All the effort in IT infrastructure is heading towards serverless functions, container orchestration, containers without actively running container hosts. The choice of hosted database or database-like storage services we are offered can make it confusing to decide. The answer is almost never I’ll running something myself.

Shunning these modern hosting because you genuinely feel that your service is so special is choosing just to take the hard path for little reason, in nearly all cases. And someone- else will use them, have the advantage of working far more on functional code, and far less on overheads, and could offer a cheap/better product than you.

Yes, I know when you are at the scale of one of the top ten internet giants it can make sense – dropbox moved their storage to their own appliances, but you’re not really Dropbox, are you?

AWS Launches MediaConnect and almost gives us multicast

It’s Re:invent time, and Amazon have launched a new service to make video routing to the cloud reliable and easier to set-up.

A few weeks back I was at the brilliant DPP Leaders Summit, it was under the Chatham House Rule.1 There were some great speakers, and I particularly loved the exec who, to paraphrase, “If it doesn’t work without months of professional-services, THEN IT ISN’T AN ACTUAL PRODUCT.”2

Anyway one of the speakers was facing rebuilding their entire stack due to ownership changes, and wanted to do so in the cloud. They said “We need multicast and Precision Time Protocol”. Which I can understand, for playout or production applications, the need for those two is pretty clear.

It’s now Re:invent season, which is the point in the year when AWS tend to release a lot of their good stuff. And yesterday they unveiled a new media ingest service AWS Elemental MediaConnect.

It’s a managed service to get your video signals to/from/between your Amazon clouds.

This has historically been a pain: back when I was working on the Video Factory project we initially mooted a box in the cloud that we would send the signal to, and then that would fan out to both archiving and live streaming. This was hard to do, so we side-stepped the issue, and just rapidly uploaded the stream to S3 in consistently sized chunks instead. Later something was put in place to do the streaming, using something that I don’t think has been spoke about too much in public, so I shan’t detail here.

Anyway, this new service allows you to send content to/from an endpoint using standard RTP (with/without Forward Error Correction) or the more reliable but commercial Zixi protocol. The video has an Amazon ARN identifier, which then means that external accounts can have permissions to subscribe to the stream, the documentation says a ‘flow’ can have up to 20 outputs.

How are we going to use this?

  1. Contribution to streaming output: fire the video somewhere and you don’t have to know if/where it’s being used
  2. Contribution for programming: using few Amazon regions, broadcasters could very easily build a global contribution network to backhaul outside-broadcasts very easily
  3. Contribution from a Playout appliance, if your cloud playout outputs to an MediaConnect flow, then you can then output that flow to your broader distribution chain, allowing re-routing of things downstream.

It isn’t multicast within a VPC, it’s not PTP, I suspect the latency involved may be too great to allow it to be used to route between different stages in a virtual playout chain3.

MediaConnect does however simplify integrating cloud processing workflows by providing fixed points at the edges in and out of the cloud.

I’ll be interested to see how people use it.

  1. That it is a singular rule is one of those bits of pedantry I cannot let go of
  2. This is probably a topic for another time, but the fact that so many enterprise vendors expect you to pay for their ‘product’ then explain that ‘oh, no, you can’t just use it out of the box even in a basic manner’ is a bit of a joke
  3. I could be very wrong here, I don’t have a one of those hanging around to test