The Weaponisation of Resilience

(This post inspired by some posts I saw on LinkedIn, and some client experiences from many many years ago)

Resilience is a useful property.

We want it in many places: In our infrastructure from floods or traffic spikes, in our organisations from attacks by bad faith actors, and within ourselves from unexpected or exceptional events in our lives.

Now, in the last few years *waves hand* we’ve had quite a bit going on, and many of us have had to call on that resilience reserve more than usual.

Maybe as a results of that, or a general awareness of mental health in the workplace, it’s now something that’s being taught to employees.

While I think that is a good thing, I think that has the perception to be weaponised.

A resilient worker, or more likely a team, has the skills/headroom/reserve to cope with a “once in an N” event, every “N”. So a “once in a week” exception every week, a once in a month exception every month, etc.

I worry that bad managers and teams, will weaponise that resilience, and expect teams to be resilient against all events, even if they’re facing a ‘once a year’ event every month.

Exceptions aren’t avoidable. Things do change, go wrong, go better, go worse.

You can’t avoid all exceptions, and those are the ones that will draw on the “resilience reserve”, but if your team is constantly facing exceptions that are caused by poor coordination or planning – you’re wasting that valuable resource you should be keeping for elsewhere.

Giving your team the tools to be resilient is great, but you’re not giving them invincibility.

Is your software more important than you realise?

Software that isn’t “safety critical” can have real-world impacts.

If you’ve been working in IT for as long as I have been, you’ll maybe remember this wonderful example of legalese:

NOTE ON JAVA SUPPORT: THE SOFTWARE PRODUCT MAY CONTAIN SUPPORT FOR PROGRAMS WRITTEN IN JAVA. JAVA TECHNOLOGY IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED, OR INTENDED FOR USE OR RESALE AS ONLINE CONTROL EQUIPMENT IN HAZARDOUS ENVIRONMENTS REQUIRING FAIL-SAFE PERFORMANCE, SUCH AS IN THE OPERATION OF NUCLEAR FACILITIES, AIRCRAFT NAVIGATION OR COMMUNICATION SYSTEMS, AIR TRAFFIC CONTROL, DIRECT LIFE SUPPORT MACHINES, OR WEAPONS SYSTEMS, IN WHICH THE FAILURE OF JAVA TECHNOLOGY COULD LEAD DIRECTLY TO DEATH, PERSONAL INJURY, OR SEVERE PHYSICAL OR ENVIRONMENTAL DAMAGE.

Windows NT4 License agreement

It’s a pretty good example of where our minds tend to go when you think of “safety critical” systems. I tend to also think about things like complex train automation systems or the Therac-25 radiation therapy machine.

All things that are complicated, but are generally grounded in physical interactions with machinery, machinery that has high energy, or that interacts with other humans.

This came to mind because once again the Post Office Horizon Scandal, one of the biggest miscarriages ever in justice in the UK, is in the news.

If you weren’t aware, the system was buggy, could cause the branch to have massive shortfalls, giving postmasters three options.

  • Make up the loss up themselves, and hope the problem didn’t happen again
  • Report that their accounts balance, which was an act of fraud
  • Try to report to the Post Office which would be unhelpful at best, or began an investigation at worse

The results of this were bleak:

  • People were wrongly convicted of fraud/stealing from the Post Office
  • People were wrongly imprisoned
  • Some people ended their lives in the immense shame of being someone who stole from their local community

In hindsight, that looks pretty safety critical… lives were materially changed, damaged, or extinguished.

What’s worse is that people from the software vendor, and the post office claimed that the system was robust, that remote access wasn’t possible – at the same time as planning remote access to resolve issues caused by known bugs.

The latest BBC Radio 4 program on this (after an amazing series), had an instance where a Post Master lost his branch due to these bugs, a new owner bought the shop, only to then experience the same bugs. The helpline gave the same line “Nobody else is reporting these problems” which sounds highly unlikely to be true.

Sure some senior people at the time have stepped down from their non-exec directorships.

In my view this is either negligence as they should have done the due diligence to ascertain that the system was generally robust.

Software is everywhere.

Ovens have Wifi, cars have highly complex computer vision, human bodies have attachments controlling insulin flows. People had artificial eye implants to help them see, that the manufacturer no longer supports.

Whistleblowing is a painful and sacrificial act for the person who does it.But if you see people from your company, testifying in courts of law that “there are no problems with the software” (an impossible situation in all by the smallest of programs), we need to provide better ways to help this information surface.

Maybe if defence teams were better briefed, a statement like that could be countered with “No problems? Cool, we’ll verify that with an extract from your Jira instance” or “a third-party code review wouldn’t be a problem”?

I don’t know the solution.

I’m not a lawyer, I’m not an ethicist, I’m not someone who typically works in these kinds of environments – but I do know that lives were lost due to an accounting system being buggy.

And that doesn’t sit right with me.

The one in which I learned to build a chat bot, and a bit about how I learn

Why revisiting known problems can be a boring but reliable way to learn, and how to think about that for Hackdays

I’ve two problems I keep coming back to and implementing (or attempting to) every now and again.

First is my home-grown voicemail system, not that anyone actually makes standard telephone calls to me anymore, but after a series of “voicemail with dictation” providers left the market, I rolled something together with Twilio and App Engine.

The other is the data feed for the TfL London Bike here scheme. I have failed to buy myself a bike, and still rely on the hire ones, and before app coverage was good, I had a webpage that could show me the docks around where I live.

These have both been revisited over the years, the Voicemail system got moved from being on Google App Engine which returned a complete web page (remember those), to running on AWS as a Single Page App, listening to an MQTT change feed, and connecting to the AWS services to retrieve data for the React App to update on screen.

Meanwhile, although the original Bike webpage went stale and stopped working, last year I started writing an application in SwiftUI that would display the data. Thanks to changes in iOS/Xcode that project can’t even run to get a screenshot anymore…

I’ve long wanted to be able to text something, and to get a reply about the bike docks nearby. Because I’m an iOS user, the Siri intents are very limited, and text message interacting is easier with voice – this would keep my eyes on the road.

So, after an idea just before bed, over the last few days I’ve created a chatbot, and I can ask it questions about the Hire bikes.

But why do I keep revisiting these problems?

TL;DR They’re boring problems that I understand the domain of.

I learn by doing.

My Professional Development Plan would be summarised as Continually Feeling Just A Little Out Of My Depth and managing to keep up.

It’s very much “I learn this stuff because I HAVE to learn it”.

Outside of that directed work frenzy, I have limited windows for learning – periods when I feel interested and able to commit some time to learning. I’ve found that I can generally learn 1 of 2 things:

  1. Approach: Something new about an existing domain I know: e.g. Using a new language, web framework or API to solve a problem I already broadly know.
  2. Domain: Learning new problem domains with existing technologies, e.g. building out a new website to use a new API, technologies I understand in areas I haven’t previously worked.

When I try to do both at once, I quickly get frustrated and quit. I’m not a full time coder, I architect systems and work with teams to coach them into building things, but I’m not best served building things myself, even though I love to work with those who are building.

When I’m working, the need to get things done powers me through any frustration walls (mostly). But when I’m doing stuff for ‘fun’, that doesn’t happen as much, so I try to only do one of the two things.

So, What does BikeBot do and What Did I Learn?

I can ask BikeBot for the details of a specific hire station if I know the name, which is useful if I’m heading to a place I know well, and just want to get details of bikes or docks that are available.

I can also ask bikebot for the dock Nearest to a point of interest, when I don’t know what the station is called.

Bikebot then returns the data from the TfL API, and is accessible over the phone with speech recognition or could be available by SMS if I configure an integration.

So, by revisiting a familiar problem, I got to learn:

  • AWS Lex, the chatbot tool
  • AWS Location, the tool I use to geocode “I’m at this Place give me dock”
  • AWS Connect, the contact-centre product that makes it accessible to voice over the phone

I now have a neat demo, which with a little more work provides me something I can use myself.

I also re-learned just how easy managed services can make solving problems, because so much of the hard lifting is done for you. I would never find the time to do fuzzy matching of station name to user input, but if I give Lex that list, it’s done for me. Not perfectly, but orders of magnitude better and faster than I’d ever be able to do it myself.

“erm, cool story bruh, but how does this help me?”

If you’re running hackdays, think about how many ideas that teams have, and if the teams are capable of learning both 1 & 2 at the same time.

Some companies frown on “doing things to do with the day job” on hackdays, really wanting more Blue Sky Out There things, but maybe your team aren’t really up for that. Or if they are, they need a bit more planning, so teams and ideas are kind of sketched out ahead of the hackday, along with any of the pre-requisites to make progress quicker.

Monoliths or Microservices: how about a middle way?

Should we deploy as micro-services or monoliths, how about neither.

The latest argument that we’re having again, is how we should deploy our systems, and we’re asking “micro services” or “monolith”.

Now, I’ll try to skip past what we mean by all of those things (because it’s covered better elsewhere), but in essence, we’re asking “does our software live in 80 repos or 1 repo?”.

TL;DR How about we aim for 8?

What does good deployment/development look like

In an ideal world, we’d have the following properties in our deployment:

  • It would have appropriate tests and automation, so deployment is easy and doesn’t feel risky
  • The potential impact to a deployment should be predictable, something over shouldn’t impact something over here
  • It should be clear where to look to change code

Problems with Micro-services

  • If you go properly granular, it can be difficult to know which repo code resides in – if you pick up a ticket, you should need to spend 15 minutes to identify where that codes live
  • Deployment of related services may need to be coordinated more closely than you’d like, ensuring that downstream components are ready to accept any new messages/API calls when they arrive
  • Setting up deployment for each new component can be time-consuming (Although with things like CDK/Terraform etc, it should be possible to template much of this to a config file for the deployment system)

Problems with Monoliths

  • Code can potentially leak into production more easily – requiring more robust feature-flagging to hide non-live code. While this is good practice, it becomes a requirement in larger repos – you can avoid this by ‘dev’ deployments not being in trunk, but that’s a different kind of deployment complexity
  • Spinning up another instance of “the system” for testing of a single component may be more expensive and fiddlier than duplicating an individual component
  • The impact of a deployment may not be known, you may need to assess if other commits included in what you’re putting live could break things, this may increase deployment friction

How about Service Cluster Deployments

In a prior engagement, we built what was really a task management system.

  1. Messages would arrive which could potentially make new tasks for the system, or update existing tasks: these were handled by the task-creator-and-updater
  2. The task-viewer would access the database of tasks, cross reference with other services, and create a unified view of the task list
  3. An automation component would use the output of of task-viewer to initiate actions to resolve the tasks, which would ultimately result in more messages arriving, which then updated the task database

In our deployment, these components were all in 3 different projects, the micro service model. And it worked, but is also an example of where these 3 components could be combined into one functional service repo.

This makes sense to me because the 3 services are closely coupled, especially between the task-creator-and-updater and the task-viewer. So maybe they could have been in a combined repo task-management

With this setup I could still feel safe doing a deployment on a Friday afternoon to one component, because even if the task management system failed entirely, the manual processes were in place to allow recovery until the system could be rolled back.

Meanwhile another one of our components, the cost of a failed deployment was so high, and even if it was recovered the time-critical nature, meant we only deployed during ‘off-peak’ periods of the week. Could it have been made more robust? Probably, but it was also a relatively static system – that effort was better spent on other components that were more ‘active’.

In summary

Your deployment should work for your team. It should be based on templated conventions that allow easy configuration of new deployments, and it should be as granular as makes sense.

Instead of worrying about being “truly micro-services” or “fire & forget monolith” find the smallest number of functional groups to keep your code in. That way you can have scope-limited deployments, without having hundreds of repos.

Finally, please, just-name-your-repos-like-this, it’s funny at first giving things amusing names, but honestly, kitchen-cooking-oven is far more supportable than the-name-of-a-dragon because it gets really hot.

Things that it was good we learned this year

Rather than a “hey look at the good that actually happened in 2020” I’d rather take a more realistic view: there are a whole bunch of things that are very good for us to have learned/been reminded of – but that aren’t necessarily good

Medical Science can do amazing things, when we want it to

  • The virus was genetically sequenced within a month of it being discovered
  • The speed at which we got a whole range of vaccines/vaccine approaches into trials was incredible – that many of the vaccines were made never having been near the virus, is incredible
  • RNA vaccines could revolutionise things – we’ve shown efficacy and some have suggested that if we have vaccine templates in place for major families of pathogens we could have the ability to rapidly respond to the next pandemic
  • Clinical trials, such as the UK “Recovery” trial found existing drugs that have improved survival rates

Turns out we need society and collective action

  • Masks help other people more than they help you… without that sense of shared responsibility their use won’t be consistent
  • Local Mutual-Aid groups popped up around communities, and people helped those shielding or isolating

The situation has codified and made visible inequalities

  • People with jobs where they can comfortably work from home, have been shielded from a lot of the worst of the situation
  • They can likely afford to buy delivery food, order online – and effectively outsourcing their risk to someone else
  • Freed from that casualisation of employment, they have far less concern if asked to isolate
  • Contactless payment everywhere is great for those of us who hate carrying cash, but less good for the under-banked who find it hard or impossible to get a contactless card or account

The casualisation of employment hurts us all

  • Casual workers tend to move around between clients and sites more, which in the case of Melbourne’s Aged Care facilities, meant that they were in a prime position to move the infection between facilities
  • Casual workers also can’t sit at home as easily waiting for a test results because they have bills to pay, or to isolate when they’ve had a contact.

Not all Key-Workers where a fancy uniform

  • Health service employees have had a terrible time, working in stressed conditions, seeing more death than any of them signed up for. But they did sign up to work for a critical function in society.
  • The same can’t be said for people who work on the tills in the supermarket. We discovered that food-retail is pretty important – but until now those people didn’t get any of the cachet for that role
  • Similarly delivery drivers, postmen, transport workers… the list goes on, but there are numerous roles, that are occasionally looked down on, that turn out to be vital

Food supply chains maybe don’t need to be quite so lean

I love a just-in-time manufacturing supply chain as much as any logistics nerd, but while it makes sense for car-parts, it maybe doesn’t make sense for everything

  • Food systems adapted, but initially they were shown wanting for supplies (interestingly packaging rather than the food goods in some cases)

5 days a week in the office is dead, remote learning can be a thing

  • Turns out that most jobs can almost be done at home – knowing this might help people less able to travel due to health conditions get work
  • 5 days a week at home isn’t great though: it can be lonely, cramped, maybe not possible if you’re in a shared home. I suspect we’ll end up with most people doing 2-3 days a week in an office and the rest remote
  • Local co-working spaces may well pop up – many people who want to get out of the house may not need a commute, they just need a comfortable place to work where they can see people, just not their colleagues
  • Many disabled people have tweeted, in justified disgust, that they asked for remote learning for years, and were told “it wasn’t possible”. Turns out that lecturers can load Teams when able-bodied people need that facility

The Economy needs to adjust to the end of Big City Centre Offices

  • Our economy is too service sector heavy: Jobs have been lost because people aren’t coming into city-centres anymore. I sympathise with those employees & unlike a govt campaign (actually from months before the pandemic) not all of those people can retrain in cyber
  • Despite this hurt, and what politicians/commercial property owners say in press op-eds: we are not obliged to support the service-sector in the city centre as a patriotic duty
  • Some of those jobs may be relocated to places nearer where people are now, (personally I’ve still bought far too many takeaway meals while WFH)
  • Going forward, do we need more mixed-use neighbourhoods like you see in mainland Europe, driving the demand for services, not driven by office-workers

The NHS

To those people working in the NHS, I think we all thank you, and hopefully we’ll find ways to do It that go beyond clapping

  • While no health service truly coped with this, years of recruitment shortfall have exacerbated difficulties in the NHS response
  • While impressive how quickly we built the Nightingale hospitals, because admitting hospitals needed to provide staffing and equipment for the patients, it made little sense for them to be used, resulting in them being white elephants
  • When freed from arbitrary government targets, the NHS can radically reconfigure itself when it needs to.

Personal Protective Equipment (PPE)

  • The entire world shouldn’t outsource production of PPE to one area in China, that happened to be hard hit by the outbreak – we should have some on-shore production that could be ramped up at times of crisis
  • PPE Management shouldn’t be given to random companies – this is a UK specific one, but we love outsourcing things, and sometimes you can’t just treat specialist things as commodities
  • Those PPE stockpiles should be actively managed – rather than building up emergency stockpiles – we should always be taking stuff out of them and replenishing, that way we don’t end up with years old PPE that needs to be re-certified for use, which doesn’t instil confidence

UK Government Response

  • The Prime Minister undermining the response before it even began, when the government message was “Don’t go home for Mother’s Day” and the PM was quoted “I’m still hoping to see my Mum” wasn’t a good look. Nor was sanctioning his advisor for taking a unique eye-test in Durham
  • Any effective pandemic response always feels excessive, because it’s done before it feels needed. Having a PM who is unable to make decisions before his hand is forced, doesn’t work so well
  • The UK needs to give up its obsession with throwing problems at generic business process outsourcing companies, they do a shit job time and time again, and have take far too much money for a sub-standard response
  • Continuing the above, local contract tracing teams work better than centralised teams – replicated in other places, not just the UK
  • After a brief period of IT improvement though GDS, the UK can once again have poor IT processes waste money with private IT companies: just look at the money spent on the initial Bluetooth app that everyone who understood iOS restrictions told them wouldn’t work
  • Worse the failed excel handover mechanism that cost lives because contacts weren’t followed up. I know that people do things at haste in situations like that… but still

We need to address online-misinformation

  • There have always been contrarian conspiracy people, but without the distribution channels of social media, their impact has been limited: their actions undermine trust in the response, vaccines. etc.

Can we stop with the (inappropriate) gatekeeping?

It’s another week, so it’s time for everyone’s favourite game: Gatekeeping.

In particular this example Chloe (a Senior Developer Advocate for Microsoft who does some cool stuff with code, while putting up with being a woman in tech on twitter) posted this:

Now there are a whole variety of reasons for this being a good thing, there’s evidence that diverse teams, while sometimes being worse at doing repetitive/samey tasks than less diverse teams, when thrown new problems do better.

Also, having people who aren’t white comp-sci males on a team leads to picking up on things, like an awareness of how your product might be mis-used. Abusers have used Venmo to send money to their victims, because “why would you want to stop someone sending you money”.

Of course, a man was here to quibble advise:

Now, machine-learning is an interesting discipline to pop up and claim that inexperienced people aren’t going to do a good job… we’ll go into that in a second.

Yes, it’s probably true that someone starting out will not be able to generate an entirely new model. But will they be able to follow tutorials and train one of the existing models? Likely yes.

Will they be able to replicate the many mistakes that ‘pro-fess-ion-al’ machine learning engineers have? Absolutely.

Machine learning has been used to codify our biases. Facial recognition performs worse on non-white faces… “flight risk assessment algorithms” which are commercially sensitive so can’t be audited, seem to report that certain communities are more of a risk.

Meanwhile there was that time that a “cancer detection” model, had actually been trained itself to detect the different colour of slide-frames that were used between control and malignant samples.

I’m just saying, that maybe Machine Learning isn’t yet the rigorous pillar of integrity and correctness that needs protection to preserve its pureness.

“React is for n00bs”

This is another good one.

When new devs start out and they use react, a variety of callouts appear:

  • “It’s too complicated, they need to learn the basics”
  • “React is too heavy, they need to learn to optimise”
  • “the amount of javascript we use on the web is too high and a security risk”
  • “if you don’t learn the basics of DOM manipulation how can you possibly do it well”
  • Server-side rendering of client-side apps is just a return to the old way
  • We shouldn’t be building apps on the web

Most of these are true to a greater or lesser extent, but you know what else is true?

This is what the web looks like now…

It is not where any of us would probably start, but it’s where we are.

Having architected a business system that uses React as the UI, that system would have been painfully unusable if every interaction was a page load on form reload… modal popups and API calls made it a better experience for users.

“They’re building unoptimised systems and that’s not good”

That is also true, however how do you learn to build an optimised system?

You ship something that gets to the point is needs to be optimised. Many systems don’t need to be… Good enough, is, well, good enough.

These things are analogous to scaling problems: if you get them, they’re nice to have.

We do want some gatekeeping

I don’t want a newbie coder to write the control software for a nuclear reactor… This is unlikely

But more realistically, the area that we need to find ways to help new programmers about about the basics of security.

I don’t want a newbie writing a user registration system, there are plenty of managed Identity Providers (IDP) out there like Auth0, Cognito, AzureAD, Login with Google, Login with Apple etc…

So yes, I wouldn’t want a newbie writing an IDP of any complexity, I can see them storing passwords in cleartext in a mysql database.

But we don’t talk about these things, or how we can give new programmers an intro to the “easy” 80% of security things: basic security on APIs, not storing secrets in your app, not using sequential/predictable IDs around the place.

It’s much more foundational “go and learn enough before we deem you WORTHY of writing for the web”.

Some people learn by doing a CompSci degree. I have one of those.

While it taught me a bunch of formal things, so much of what I’ve learned is by working with good people, making mistakes, and learning more.

I learned React in part because I was working with a bunch of coders who were learning it…As an old school HTML, JS, JQuery & CSS person, I was initially confused and scared of it. Then create-react-app appeared and I finally got it.

If we don’t turn down this obsession of gatekeeping entry, we don’t let new people learn.

We end-up with the same faces, and products will be worse for everyone. Us older-school people will get stale, stagnate and just write the same stuff until we get retired.

We can nudge better than with streaks…

The brittleness of breaking a streak needs to be broken somehow.

So yesterday I had my phone swiped by someone who was on a bicycle, this was thankfully one of the few times I’ve experience crime in my life, and was non-violent, and the phone has been remote wiped.

This is annoying, leaves me a little shaken, and also annoyed as the thing will likely be dismantled for parts. It’s locked and while I know the that is imperfect, it’s work to remove. But, I have insurance, so it’s all good mostly ok.

But, frustratingly because my phone was gone, I just lost my 205 day Activity streak from my watch.

This was slightly annoying because one bit of Apple did know I’d moved, the “with friends” feature, but I suspect that is just sharing 6 numbers every few hours (the target and amount done for the 3 categories). This isn’t the “health” database, which is a far more granular time-series database, so the streak is broken.

Adding insult to injury theft, the watch even congratulated me for hitting my goal and extending my streak; unaware the achievement would disappear into the electronic void.

And this is the problem with these things, my mind (at least) goes to a place that “getting back to 205 days will be hard, it’s winter”. This now becomes a bit “why bother” rather than incentive.

The same thing applies to Duo-Lingo, Headspace, etc. When it’s a daily thing, and you fall off the wagon, it feels difficult to get back on. I know it might not be as effective, but what if I want to do 3 days of week of each – my 10 minute self-improvement slot.

David Smiths MyTrends++ has the concept of rest days, if you get 7 days on, you’re allowed 1 day off. I think that works nicely.

However, these things also don’t account for if you got ill for a few days and couldn’t exercise. I hate to cite the UK privatised rail system, but they are (or at least were) allowed to declare void days, when evaluations didn’t count – I guess charitably you could say these were fair when absolutely awful weather hit.

If you’re ill and not feeling able to exercise for a few days, that’s (probably) not your fault. But I think, paradoxically, the bigger the streak the bigger the “I’ll never get there again” feelings that arise.

Amazon’s new Halo wearable judges you over a week (if I remember correctly from a podcast), rather than daily. Which I think it’s a more sensible, and understanding of your life.

I don’t know how to fix this as I’m not a behavioural psychologist, but I wonder if you should be earning the equivalent of “long service leave” – maybe you earn one cheat-day a month, up to a maximum of 5?

I don’t know the solution to this, but I think the current obsession with daily gamification goes isn’t really that great, and I know we can do better.

A feature request for LinkedIn

I like LinkedIn, but I would love if I could make recruitment messages more relevant.

I’m about to whine about recruitment, which I understand isn’t great when many good people are looking for work.

If you can do anything to help people in your network, recommendations, connecting people up – now is the time to lower your reputation-risk considerations (what if they aren’t a match, aren’t good) and do it anyway.

Although I dislike the Storification of LinkedIn, and find “Heart Warming Stories of Dubious Origin, About That One Time Someone Showed Basic Human Empathy” posts a little grating, I like LinkedIn.

I primarily work in the Media & Entertainment industry, and very often people move around. One time I was working with a team who were re-engineering a high profile transcode stack, and we needed to check compatibility that one consumer with very Fussy Set-Top Boxes specific H264 encoding parameters.

Searching on LinkedIn found that someone I’d previously worked with was now there, and that was one of those useful back-channels that actually get the work done, alongside the formal ones where invariably detail is lost in all the mediation layers.

I’ve previously found work through LinkedIn also, people in my network were looking and we had chats…

In both of these cases it was a route to contact people who I likely wouldn’t have managed otherwise.

The Bit Where I Bitch About Recruiters

While I know #NotAllRecruiters, many are somewhat annoying.

I’m quite specific in my profile intro of the kind of roles I’m open to, and still I get requests to be a: Permanent, SAP, Project Manager, in Bracknell.

That’s one technology I’ve never worked with (merely around) and 3 job qualities that I will avoid.

Tiresome for everyone, a waste of my time to read and theirs to send.

The over-engineered solution

As mentioned, I’ve a number of relatively simple conditions about jobs I’ll consider.

One time I got a message about a job that was “Only for Oxbridge graduates, but Imperial is also OK” – I know this was meant to be flattering and give the impression of an intellectual workplace (while also being a bit negging that “Imperial was almost good enough”). However, it just screamed of a horrendously toxic culture with Platinum Grade Gatekeeping.

So if you’re specific about what you’re looking for why don’t you get to state that in some questions, and when a recruiter who isn’t in your network wants to contact you, how about they’re given a page like this… (please excuse the 💩 mock)

A list of questions a recruiter might face: is the position permanent or contract, using appropriate technologies, what the salary is

Actually Maybe This Is Application for ML…

As I was writing this (helpfully after doing the 💩 mockup), I thought of a much better solution: If you can choose from a smaller range of criteria – and ones that could be detected by an ML classifier – LinkedIn could just run the classifiers you care about on an “out of network” message.

The score of the message could then drive a traffic light system: the message is accepted, outright denied, and if borderline the sender needs to click a “Yes, it’s appropriate and your classier is wrong, scouts honour, promise” button.

Would it work?

Unless there was a penalty for clicking “This Isn’t Spam” I doubt it would.

I also suspect it would hurt LinkedIn’s revenue too much, if having paid for Gold Premium Ultra, people aren’t able to send messages

To the good recruiters, who like great project managers are rare but invaluable – I’m sorry.

To the rest of you, I’m just not ready to do SAP in Bracknell.

Prescriptive Software Practices: Code Re-use Edition

Individual software practices don’t exist in a vacuum, and need to be viewed collectively.

Today I saw this tweet, that I initially violently agreed with, before realising the answer is really more “it depends”.

Now I fully agree that demanding that people write the abstraction layer before they’ve even written the first component to use the underlying tool, is a folly that leads to bad libraries. You don’t know how to best use the underlying API, and you don’t know how you want to use it, and which of the methods you want to wrap or enhance.

The requirement to wrap every ‘method’ is the main reason I dislike intermediate libraries, one time I asked “are we using this new AWS feature that’s perfect for our use case?” The answer: “No, we can’t because Terraform doesn’t support it yet.”

Any time you put something in-between you and the underlying service you’re introducing a potential roadblock. I’ll explain later how I think you can minimise this.

The main reason I think code-reuse/libraries are hard to get right is a conflict at the core of them:

  • A trivial library can be simple to use, but if the functionality is simple, what is it really adding?
  • A feature-filled library is usually (but not always) harder to make use of, and if most people only use a fraction of it, what makes it worth they overhead?

Things don’t exist in isolation…

Warning, inbound analogy: Very often “we” like to look to other counties and cite how wonderfully they do a thing. An example from the UK is that we’re often told that “people shouldn’t mind renting flats because Germany people tend to buy later.”

Which sounds great, but when you point out that Germany has a bunch of related things – longer leases, more freedom to decorate/change properties, and that they consistently build houses to maintain far more modest house-price rises – people tend to go quiet.

Returning to software, everything is similarly related and supported by other practices. If you don’t fully understand a problem, you can’t cleanly decompose it in a sensible collection of services, and only when you’ve done that will sensible opportunities for code re-use/libraries emerge. (At this point you’re welcome to argue that if you’ve decomposed your system properly then you should need to reimplement functions).

XP/Agile/Clean Code/BDD/TDD/… can become quasi-religious in how much you must adhere to all of their tenets. I suspect very few people are fully compliant with any one tribe, and to be effective as teams you need to view things are recommendations or possibilities, and not commandments that thou shall obey.

How to do code re-use right…

This is just my experience, but a few questions to ask or points that I’ve found have worked for the people I’ve worked with in the past:

  • Avoid needing them in a first place: if your transaction volume is low enough just have a dedicated service that does the particular thing… A single point of truth is the easiest way, but that isn’t always possible due to latency or cost concerns
  • Consider Security/Auth/Data-protection first: These are things that you need to create decent libraries/patterns for, because if the easiest thing is the right thing, you’re going to be making fewer critical mistakes, and it can make patching easier if you’re exposing a consistent interface but have to update an underlying library with breaking changes
  • Judge the demand: While many times people can be “wow, I didn’t realise I needed x until it appeared” unless it’s really obvious that lots of people have the exact problem, do you really need to write a library?
  • Understand it before you abstract it: Don’t write them first. My ideal preference is that when you have a few teams working in the domain, let them create distinct implementations. Afterwards, regroup and use that learning as the basis for a library. This is more work, but the result will be much better
  • Keep the library fresh: Is it one person’s job? Is it a collective whole-team effort? A library needs to be a living thing as the systems it interacts with will change. Developers will rightly shy away from using a clunky piece of abandoned code
  • Layer up in blocks: a client has a back-end system with specific authentication requirements and has been building out client libraries. There are 3 distinct libraries: connection management, request building and result parsing. You didn’t have to use all of these, and can just use the connection library if you want more direct access
  • Make your library defensive but permissive: TypeScript has got me back into typing, but previous experience makes me nervous. In micro-services environments a library update can require many unrelated deployments, when only be two components are functionally involved. Errors because enums aren’t valid can be useful, but can you expose the error when that property is accessed rather than parsed?

In summary…

Teams need to find their own path, and find where on the line between “Don’t Repeat Yourself” and “Just Copy-Paste/Do Your Own Thing” they lie. It is highly unlikely to be at either extreme.

“It Depends” isn’t a particularly enlightening answer, but like so many things about building decent products, it is what it is.

On THAT Excel Issue

What can we actually learn from the government’s Excel related issues?

There have been many comments posted in the last week about “excelgate” or whatever we want to call a life-threating data exchange problem. This post is not about absolving the government of blame for this, or the countless failings they’ve made across the Test & Trace programme. Between the app that everyone who understood iOS Bluetooth told them wouldn’t work, giving the bulk of Contract Tracing to private companies not to local health teams… I’m really not excusing them.

But, I do think there are more naunced lessons that can be learned beyond “LOL WOT THESE N00B5 USING M$ EXCEL. Y U NO PYTHON?” which is an exageration, but not by much, of some of what I’ve seen online.

I’m writing this based on the following assumptions/guesses: Data had to get from one system, to another – and .xls not .xlsx was used, this hit a row limit. (This really should have been an automated feed, but that’s not what I want to explore here, I want to explore how organisations can prevent people doing ‘good’ things)

So, we’re using an inappropriate data transfer format, with a hard limit of how many rows it can contain… This sets up a few different scenarios:

  • Nobody foresaw this problem
  • The problem was known, but the decision was taken not to fix it
  • It was known, people wanted to fix it, but couldn’t

If we explore these, I think there’s some learning we can take away for organisations we work for or with, about how some of our anti-patterns might lead to scenarios that put us into them.

Nobody Foresaw This

This would be the most damning of the outcomes. It was a risk that nobody had realised that they were living with, and crucially that the software doing the export didn’t warn you about.

Tips to avoid it:

  • To borrow from the WHO: Testers. Testers. Testers. Hire decent testers, the one who infuriates you with “What if this series of 3 highly improbable events happens?”
  • As we’ll come onto in a second, listen to them when they say these things.

It was known about, but decisions were taken not to fix.

These aren’t fun, especially as someone who predicted a particularly nasty auto-scaling bug one time, tried to warn people, but it wasn’t accepted that it needed to be fixed until it occurred, it can always leave you feeling “if only you’d argued the case better”.

But it’s legacy…

Matt Hancock, the UK Health & Social Care Secretary, described the system as (paraphrased) “Legacy and being replaced”.

We’ve been here, a system that is old, being replaced, is considered frozen because “it’s going away”. However, I know of systems that were due for replacement in the next 6 months, but 3+ years later development hasn’t started. This was used as a reason not to do relatively trivial UX changes, that could have been a great improvement to the operators.

Tips to avoid:

  • Until you unplug the server, turn off the instance or stop new data flowing into it, no system is “legacy”

“It’s very unlikely… we can live with it”

Nobody, apart from epidemologists and software billionaires, predicted a future epidemic on this scale – so I guess that maybe the problem was known, the decision could have been taken to live with it. Going back to the first recommendation and hiring a tester, sometimes so many scenarios are found, it’s easy to tune out because like Cassandra, the tester is always talking about problems.

Tips to avoid:

  • It’s ok not to fix everything, but if you’re living with a risk, make sure it’s known, and doesn’t fail silently.
  • Keep it in your risk log, and actually re-read that once a quarter and assess if they’re now more of a problem.
  • Try to be a little less agile, at least in methodological purity, and go beyond “what we’re building next” and look a few steps ahead.

We wanted to fix it…

This is when we get into some of the most depressing collection of scenarios:

“You can’t just make a change, this needs a PROJECT”

Changes need to be properly developed, tested and deployed, but sometimes this doesn’t need a full project structure created. When all improvements are painful to implement, people just accept and build workarounds, some of which you may not be aware are in place.

Tips to avoid:

  • Have a lightweight process for “short-order” requests that are small.
  • Find ways to bundle these into bigger releases alongside the “im-por-tant” work.

“It’s too expensive”

If you have a bad contract with your supplier, it could just cost too much to viably fix.

Tips to avoid:

  • Only buy software/services where the API is included, and is nice to develop against (I’m looking at you SOAP)
  • Have clear boundaries in your systems/components, own the integrations yourself, so you can swap components or combine as required

“The person who develops it is too busy/gone away”

You could imagine that if this system was modifiable, that right now the people with IT skills are maybe elsewhere working on the other plethora of systems that have have to be spun-up to cope with the current situation.

Worse though, is when software has gone-stale and while you maybe have developers who could work on the problem, nobody really understands how to build/deploy it anymore, it’s effectively stuck.

I’ve worked with clients who had problems with code going stale, and instituted very strict “if you modify a component you must fully adapt it to be inline with our current standards” to fix this. However, this just introduced a disincentive to make minor changes to improve things, because the developers knew that alongside 5 lines of functional code changes, they had to make 500 of dependency related changes.

Tips to avoid this:

  • Avoid one product/system/component being solely one persons ‘thing’.
  • Find ways to allow people to deploy minor changes as a BAU process, gradually updating components into modern ways of working without dogmatically requiring every component to be fully updated.

In conclusion

We’ve all used excel files or CSVs in email, or a google sheet as an interim solution. The problem is that these interims become permanent and eventually they stop working. I’m lucky in that mine were about keeping TV or VOD on-air, and not about life or death statistical reporting processes.

But still, let’s tone down the sneering “BUT WHY WASN’T IT AUTOMATED” talk, yes, it clearly should have been, but none of us know the decisions being made, or the available software hooks that the operators/developers had access to.

Always monitor your systems, spot where things can be better and make the incremental improvements because they add up over time. Never invest all your hope in the new system/rewrite because they’re always years away, and usually come with their own new ‘quirks’.