Complete means complete…

Quick rant here, but I’m in the process of leaving a service that requires some careful presentation of the data/documents for regulatory type purposes.

Fine, I think, they tell me to head to the download page, and download my data.

Use this to get a complete export of all your XXXX data or restrict by date. The export includes all accompanying receipt attachments and PDF documents where relevant.

If you need to restrict the results further you can do this by using the filters on the relevant list and running the export from there.

That sounds pretty definitive, great, it’ll be a big zip file with everything.

“Complete export” is even in bold, that shows intent.

The zip file expands into a folder with the word “complete” In it.

BUT as you’ve now doubt guessed from all the signposting, it is not however complete. There are a bunch of things not included.

It’s not stated what things aren’t included, because you know, that would be too easy.

TL;DR, If you use the world ‘complete’ in your data extraction, include everything

At the very least tell me what’s missing.

Good to have it confirmed why I’m leaving this service.

State-Issued Identifiers aren’t generally good passwords

Just because a number is long, doesn’t mean that it’s secure…

For many years, despite repeated requests, a South African bank has been sending me bank statements.

The thing is that I don’t live in South Africa, have never visited, nor banked there. But I do have a particularly “good” email address that gets a lot of misdirected emails… I usually don’t read them, but this week I did, as I replied for the latest attempt of “please stop”.

The email included a PDF statement, password protected with the South African ID number of the customer. I suspect that protection is why the service centre seem so unbothered by the repeated requests to stop this information.

A few years back, I got very embedded in PII/GDPR. We were designing a data warehouse setup that allowed analysis of user activity, while protecting privacy, and enabling easier compliance with GDPR deletion requests. There was discussion about the feasibility of SHA2 hash reversal… and we took a lot of time to communicate the infeasibility with the legal team.

So this week, I started to wonder: If I were a bad actor (which I am not), could I feasibly crack this ID with some Python?

What’s the challenge?

There are three sets to consider in this:

  • The Absolute Keyspace: without any knowledge of the identifiers, the number of combinations?
  • The Available Keyspace: if the ID has constraints, how many valid combinations are there?
  • The Effective Keyspace: if we know anything more, how many combinations are applicable?

A good security identifier should make discerning the differences between these difficult: it should involve a reasonable amount of calculation before it becomes obvious that an ID is valid, and correct.

What do we know?

South African ID Numbers are 13 decimal digits.

A single decimal can be 0-9, ten in total, and so each digit has a a cardinality of 10.

(Cardinality being the maths word for ‘number of choices’ in a set)

We can use this information to calculate the absolute keyspace:

Keyspace_{Absolute} = &10 \times 10 \times 10 \times 10 \times 10 \\
& \times 10 \times 10 \times 10 \times 10 \\
& \times 10 \times 10 \times 10 \times 10  \\
 = & 10^{13}\\ 
 = & 10,000,000,000,000

Since this work was derived in a Jupyter notebook, I’ll also include some python as we go along…

Python Code for absolute keyspace

Number of combinations: 10,000,000,000,000

Is this feasible?

So without knowing anything about the ID, there are 10 trillion combinations to check.

Python can attempt to open a PDF with a password around 150 times per second. This would be our basic implementation.

More specialised tools like John the Ripper raise that rate 4,500 per second, that’s around 30 times faster.

We’ll put these into a summary function, as we’ll be calling this a few times.

Python for Summary Function

At slow rate: 18,518,518.52 hours, or: 771,604.94 days
At optimised: 617,283.95 hours, or: 25,720.165 days

At this stage, it would not be worth the cost to brute-force this.

The information in the ID, or the document, is not valuable enough.

Scoping the “Available Keyspace”

The South African ID format is described here and this OECD PDF.

The format is YYMMDDSSSSCAZ:

  • YYMMDD is the date of birth
  • SSSS separating people born on the same day
    • Female entries start 0-4, males entries start 5-9
  • C represents citizenship status
  • A was previously used to represent race, but is now unspecified
  • Z is a checksum digit, using the Luhn algorithm

How does this help us?

The Z check digit reduces the key space by a factor of 10: we “only” have to brute-force for the first 12 digits, and calculate the 13th.

Since A is unspecified, we will leave its cardinality unchanged at 10.

The C citizenship status can be 0, 1, or 2. This digit now has cardinality of 3 instead of 10.


The YYMMDD digits are dates, these have constraints:

  • Years are from 00-99
  • Months are from 01-12
  • Days are from 01-31

If we just consider those digits individually, we can calculate the cardinality like this:

dates &= 10 \times 10 \times 2 \times 10 \times 4 \times 10 \\
&= 80,000

But that’s going to consider many impossible days: month 19 doesn’t exist, nor does day 35.

So we could just consider these 6 digits as a combined date field, and get a more useful answer:

dates &= years \times days \\
& = 100 \times 365 \\
&= 36,500

(Yes, I am ignoring leap-years in this calculation… they’re not material to this calculation)

Our new understanding of the ID number format comes together, and we can compare the Absolute with the Available keyspaces:

Keyspace_{Absolute} = &10^{13}\\
= &10,000,000,000,000 \\
Keyspace_{Available} = &valid\_dates \times serial\_numbers \\
& \times citizenship\_status \times A\_column \\
= &36,500 \times 10,000 \times 3 \times 10 \\
= &10,950,000,000

So even without any knowledge of the target, only 0.1095% of the original key space needs to be searched.

Python for Available keyspace

Number of valid ID numbers: 10,950,000,000
At slow rate: 20,277.78 hours, or: 844.91 days
At optimised: 675.93 hours, or: 28.164 days

One month with optimised checking could be feasible, especially if you rented some machines… but can we do better?

What’s the Effective Keyspace?

A badly recreated email from a bank, with a banner "your electronic statement". The intro reads: Dear Mr G. Customer Please Find Attached your Statement for May • Information that only you will know is displayed in the eStatement verification block. This is done so you can be sure your statement is from BANKCO. • You will be required to enter the 13 digits of your identity number to view your statement. There is an additional box: Verification Info account number: *******5678 ID Number: *********1234
A recreation of the email sent by the bank

Revisiting the email, it contains some info I’ve ignored until now:

  • The last 4 digits of the ID as verification
  • The recipient is addressed as ‘Mr Customer’

Going back to the format YYMMDDSSSSCAZ.

We know the values for C & A, so those now have cardinality of 1.

  • C ‘only’ had a cardinality of 3, so that’s excludes 67% of possibilities
  • A had a cardinality of 10, so that excludes of 90% of possibilities

These combine however, so the remaining amount from knowing C & A is: \frac{1}{3} \times \frac{1}{10} = \frac{1}{30}

Let’s reconsider the SSSS block: which we’ll refer to as S1, S2, S3, S4.

  • Since our ID is male we know that the S1 must be 5/6/7/8/9, so cardinality of that digit is now 5
  • We know S4, so it has cardinality of 1

Again these combine, so the total remaining is: \frac{5}{10} \times \frac{1}{10} = \frac{1}{20}

SSSS_{possible} &= 10^{4} = 10,000 \\
SSSS_{from\_email} &= 5 \times 10 \times 10 \times 1 \\
&= 500

Checking in the formula again:

Keyspace_{Absolute} = &10,000,000,000,000 \\
Keyspace_{Available} = &10,950,000,000 \\
Keyspace_{Effective} = &valid\_dates \times serial\_numbers \\
& \times citizenship\_status \times A\_column \\
= &36,500 \times 500 \times 1 \times 1 \\
= &18,250,000

We’re now down to 18.25 million possible keys to check.

Python for effective keyspace

Number of numbers matching email: 18,250,000
At slow rate: 33.80 hours, or: 1.41 days
At optimised: 1.13 hours, or: 0.047 days

Even the naive, 1.41 days is really starting to look feasible, and with John the Ripper, we’re already doing it in little over an hour.

But what about the check digit?

Earlier we ignored the check number, since we can calculate it… but we were supplied it in the email.

We can use it to see if the ID is a potential match, and only check matching ones against the file.

Luhn format checkdigits use simple modulo 10 arithmetic.

This means only 10% of the generated IDs will be checked against the PDF password.

Python for checksum validated keyspace

Entries matching email check digit: 1,825,000
At slow rate: 3.38 hours, or: 0.14 days
At optimised: 0.11 hours, or: 0.005 days

Our ‘effective’ keyspace is now 1,825,000 entries.

So even with a naive implementation just in Python, we can do it in less than a day.

Age scoping

A friend pointed out that searching all ages between 0-100 is a bit pointless, so we could change that to be a range of 18-70?

Because the birthday field covers 100 years cleanly, we calculate the number of years we want to test.

keyspace_{Age Scoped} = keyspace_{effective} \times \frac{years\_to\_test}{100}

However, given the effective keyspace we’re already down to, the impact of age reduction feels less useful in this scenario, if you had less information this reduction could be more useful.

Python for age reduced keyspace

Number in valid age range: 949,000.0
At slow rate: 1.76 hours, or: 0.07 days
At optimised: 0.06 hours, or: 0.002 days

In summary

From an absolute key space of 11,000,000,000,000, we’ve excluded over 99.99999% of the possible numbers, and have only 949,000 to check against the file.

Python to generate table

Set of potential ID NumbersSize of setPercentage of AbsoluteHours @ 150/sHours @ 4,500/s
Absolute Keyspace10,000,000,000,000100.000000%18,518,518.5617,284.0
Available Keyspace10,950,000,0000.109500%20,277.8675.9
Email Keyspace18,250,0000.000182%33.81.1
Using Email Checkdigit1,825,0000.000018%3.40.1
Limiting by Age949,0000.000009%1.80.1


Graphing this is really hard, differences in scale make it really difficult to communicate.

Since I’m not a data-vis genius, this uses a log-scale.

Python to generate Graph


This analysis shows that in common with recent attacks on the Tetra encryption system , if you’re not using all of the absolute keyspace, your protection is far weaker than may appear from a big number.

These national/structured IDs do not make good secrets: the structure inherently reduces the size of the effective keyspace, and makes it very easy to exclude ranges of people (by age or gender).

While phishing is a problem, and emails need to be/appear authentic – we need to use mechanisms to achieve this at the email level: SPF, DKIM, DMARC, BIMI. While imperfect, these are far better than including information directly related to the ID/information being protected.

In this scenario even with a naive implementations, it would be entirely feasible to brute-force this particular email/pdf combination, which would expose customer information.

Now I don’t know how valuable that information in the statement is, but I wonder if it be used as part of a social engineering attack?

A plea to companies: If I message asking you to stop send me statements, maybe stop?

Save Money, deploy IPv6 in your VPC

Getting past uncertainty or unwillingness to use IPv6 is actively costing you money.

I (mostly) have IPv6 deployed in my home network, alas a bug on my router currently prevents it working completely, but for years it’s been enabled, and mostly as a novelty to feel like I was ready for the grand future – but without really noticing any differences.

Last week however, I had a chat that made me realise that using IPv6 is now a cost-saving measure… and it’s maybe time to get over our resistance to do it by default.

The Joy of NAT

Typically you deploy an AWS VPC using internal IPv4 addresses, and (if you’re as allergic to avoidable self-managed service as I am), a managed NAT gateway.

As with nearly everything AWS, there’s an hourly cost, and also charges for the volume of data transferred.

In an AWS group I’m a member of, someone asked “We’re paying a lot for NAT, what can we do” – and I said “I mean, you could try the IPv6 Egress Gateway, but I dunno if the APIs you’re using support IPv6”.

The Egress only IPv6 gateway only charges ‘standard’ egress rates and it has no rental cost. I’ve been aware of its existence for years, but have never deployed it or had reason to.

I was expecting to be told “Only one of my APIs supports IPv6” but the person reported back “Actually nearly all my APIs are on IPv6, but I can’t deploy the IPv6 easily because of <reasons>”.

This was not what I was expecting: despite, remembering that for the last few years wherever I can deploy a dual-stack endpoint, that I have been…

So why haven’t we

There are a few reasons why we haven’t been deploying IPv6 more routinely, and we should try to change these:

  1. IPv6 didn’t have an advantages. In most cases the IPv4 setup works ok… So what’s the point of adding it to a working setup…
  2. The deployment tooling doesn’t support it – this was the case here, the person was using the lovely AWS CDK to deploy their VPC, and the current VPC ‘construct’ doesn’t support IPv6 easily
  3. We like NAT’s Security by default: With NAT, your compute resources aren’t exposed to the internet, all that endpoints see are 1 or 2 shared IPs… There’s no way to accidentally open inbound connections, and that security by default is pleasing. Even though the IPv6 Egress Only gateway doesn’t allow inbound connections, you do feel more exposed, and so much of security is an annoying vibes type thing

Some steps to take

It’s sometimes easy to forget that we have run out of IPv4 addresses, and that we’re muddling through with Carrier Grade NAT and other annoying technologies that work, but make everyone’s lives just that little bit worse… But we can improve this incrementally, and anyone who’s worked with me knows I love making things better gradually.

  1. Enable IPv6 on any managed services you use by default. If you’re running CDN hosts on CloudFront, or an API on API Gateway, those can all run IPv6, and are running it outside of your VPC. There is no security risk to you, and by creating those, you give other people benefit
  2. Enable IPv6 in your VPC, but just on your public subnets. If adding IPv6 makes you nervous, start with ‘just’ doing it on your public subnets, and use it to given your load-balancers IPv6 addresses. This makes life better for other people, and gives learning for the next step
  3. Enable IPv6 with Egress only gateway, for any subnets that have NAT: this is where you start saving money, as hopefully you’re NAT’ing less traffic and saving on those charges

Not every use-case will save money, I haven’t generally had the patterns of bandwidth usage that would benefit from this… but if you’re using a lot of NAT bandwidth, maybe it’s time to look at IPv6 as a potential cost-saving.

About “That” Prime Engineering article

Everyone who has worked with managed cloud services has experienced the moment when it made sense to move away from managed services.

Turns out, so does Prime Video.

Amazon Prime Video recently wrote about how changing away from managed services and writing a more integrated application saved them money. Despite being a few months old, this appeared to blow-up this week, and predictably has caused some cries of “SEE, SEE YOU SHOULD JUST RUN EVERYTHING YOURSELF”.

But to those of use who have been building on AWS (and other providers) for many years, it’s not a surprise, and we all have stories where we’ve done similar.

I say this as someone who is an annoying cheerleader for serverless compute and managed services, but despite that, I have home-rolled things, when it made sense.

How do you solve a problem

When you’re solving a problem, you look at what the managed services that you have available, considering factors like:

  • Your teams experience with the service
  • Limitations on the service, and what it was intended for, against what you’re doing
  • What quotas may apply that you hit
  • How the pricing model works

While pricing for managed-services is generally based on usage, sometimes specific costs will apply more to your workload, e.g. if you’re serving small images, you’ll be more impacted by the per-request cost than the bandwidth charges.

I would be surprised if an experienced architect hasn’t faced a situation where “Service X would be perfect for this, if only it didn’t have this restriction Y, or wasn’t priced for Z”.

My example

We’d built out a system that was performing 3 distinct processing steps on large files.

The system had built out incrementally, and we had the 3 steps on three different auto-scale groups, fed by queues.

While some of the requests could be processed from S3 as a stream, one task required downloading the file to a filesystem, and that download took time.

The users wanted to reduce the end-to-end processing time. Some of the tasks were predicated on passing prior steps, and so we didn’t want to make the steps parallel.

Attempt 1: “EFS looks good”

We used the ‘relatively’ new Elastic File System service from AWS… The first component downloaded the file, subsequent operations used this download.

This also had the advantage that the since the ‘smallest’ service was first, you paid for that download on the cheapest instance, and the more expensive instances didn’t have to download it.

We developed, deployed, and for the first morning it was really quick… until we discovered that we were using burst quota, and spent the afternoon rolling back.

Filesystem throughput was allocated based on the amount stored on the filesystem, but as this was a transient process, we didn’t replenish it quickly enough, and didn’t like the idea of just making large random files to earn it.

Now you can just pay for provisioned throughput, perhaps in a small part because of a conversation we had with the account managers.

Attempt 2: “Sod it, just combine them”

The processes varied in complexity, there was effectively a trivial, a medium, and a high complexity task… So the second solution we approached was combining all the tasks onto a single service… the computing power for the highest task would zoom through the other two tasks, and so we combined them into what I jokingly called “the microlith”.

We didn’t touch the other parts of the pipeline, or the database, they remained in other services, but combining the 3 steps worked.

What did we gain

The system was faster, and even more usefully to operators, more predictable.

Once processing had started you could tell, based on the file size, when the items would be ready…

Much like “lower p90 but higher maximum” feels better for user experience, this consistency was great.

What did we lose

Two of the three components had external dependancies, and this did mean this component was one of the less ‘safe’ to deploy, and while tests built up to defend against that… the impact of failed deploy was larger than you’d want.

In Conclusion

There are always times when breaking your patterns makes sense, the key is knowing what you’re both gaining and losing, and taking decisions for the right reasons at the right times.

Prime video refining an architecture to better meet scaling and cost models, making it less “Pure”, isn’t the gotcha against these services that some people would have you believe.

“Pure” design doesn’t win prizes.

Suitable design does.

Can I port my number out of Skype UK?

Here’s the one wild secret they don’t want you to know.

Please forgive me the horrible SEO title and URL…

I recently ported a number out of Skype UK to a much cheaper SIP VOIP provider. Skype served me well for a number I really just needed for compliance reasons… and it seemed obvious that I should actually transfer it to a supplier that would cost me about 12GBP per year, rather than 57 EUR as I was paying.

Number porting in the UK is a magical mystery tour, for all the times it “just works” other times you’ll feel like you’re playing the worst possible game of DND, with a horrible enemy and a disinterested Game Master.

Skype initially said “you don’t need to do anything special, just ask the new provider to port in” but turns out there was this one secret trick they didn’t think I’d want to know…

Skype expect your ‘surname’ as provided by your new supplier, to actually be your Skype username.

I discovered this after two failed porting attempts (for which my new super cheap provider charges me, which I understand, since the ongoing rental is so cheap).

I had asked Skype multiple times what to do: before the first port, after the first failure, after the second failure.

Most of these interactions were pretty infuriating “Porting is done by your new provider… it’s nothing to do with us” – despite that fact that My New Provider asks Skype to port, and Skype then says yes/no… They are pretty involved in the situation.

Skype wouldn’t tell the new provider what was wrong, beyond “the surname didn’t match” – This is entirely correct, because since Skype UK don’t seem to implement any porting code/verification scheme, for security they can’t and shouldn’t tell the port destination…

However they wouldn’t then tell me, via an authenticated channel, what they were expecting. Merely “you did it wrong try again… porting is nothing to do with us.”

It was only when I then went on another chat, talking about going to the ombudsman, and generally being a pain, did the porting team authorise the port, and revealed to the provider “yeah, the surname needs to be the username”.

This is not mentioned on their documentation. Or provided in live-chat because it’s seemingly a rare occurrence.

So if you’ve come to this post, having had problems porting out, try again telling your new provider that your ‘surname’ is in fact your username.

2023 Comms Resolutions

What things can you do in 2023 to make you communications more efficient and considerate in the world.

I don’t really like New Years resolutions for reasons beyond the scope of this post.

This year however, am going to try and make a few changes to how I communicate, in work and otherwise.

“No Hello”

No Hello on instant messaging.

I hate being on the end of the Dangling Hello, and the 15 minutes of massively predicting what the person is wanting. But I still find it very hard to bundle in all up in the first message.

Equally, 4 notifications in quick succession can feel like literal torture.

You can still ask how people are doing, but you can just include that upfront, in a single message.

Hello X, hope you’re good, can you tell me what’s going on with TICKET-123

Me, Slack, This Year

Priority Tagging, ideally lower

Low Priority exists as well as High Priority on emails.

Flagging a gossipy/catchup IM as such in the opening.

Clarify & Summarise

The discomfort at being That Guy who pastes back the summary of what you agreed is less than the pain when you discover that you weren’t all sharing understanding.

When half the team thought “advance by 2 seconds” meant delete 2 seconds, and the other half thought add 2 seconds..

Always a default

When arranging things, I’m going to offer a default, always.

“I’m free all day” vs “I’m free all day, how about 11”

Make it easier to say “Great” done.

Stick to Core Hours

I’m a freelancer, I work self-defined hours… but that’s not mine to share with others.

While it’s useful for me to get thoughts out of my head into an email, that doesn’t mean I need to get them forced onto other people…

  • If I’m sending an email, I’ll set it to send later
  • If it’s an IM, I’ll set Slackbot to remind me or maybe the person, during the next working day

In Conclusion

We all drowning in a sea of notifications, if you can make yours just a little better, you make it easier for people around you to help you.

Smartspeaker listening is massively up, but sure, delaying podcasts will help drive adoption

The BBC really need to add more Siri intents to Sounds to enable smart speaker listening.

Rajar data continues to show that Smart speaker listening is on the up.

As someone thoroughly locked into the Apple ecosystem, until recently I’ve not been able to easily ask to BBC Radio services through my HomePods. I had to follow some Reddit posts to install shortcuts “hey Siri play Bbc 6 music” and suddenly I find myself listening to more BBC Radio as a consequence.

Previously I think you had to listen to stations ‘enough’ for Siri to recognise the activity, then it could be a suggested shortcut. It was a bit ugly, and down to how intents originally worked.

The Siri APIs have got more developed over time, and now it is possible to do this in a way that doesn’t require upfront declaration.

It’s funny then, that given the choice of “adding a play with Siri intent to BBC Sounds” or “delaying podcast release to open platforms”, that Auntie chose the latter…

This has the strange result:

  • The BBC’s first party actions have me listening to less BBC Podcast Audio – why listen to a topical podcast 4 weeks later, and having to remember to check BBC Sounds doesn’t match my workflow
  • Actions by a Third-Party, have me listening to more BBC Live audio

I know there are always backlogs, but Siri intents have been around for a while now…

You make me install it… you tell me how to remove it

Having installed some invasive ‘online proctoring’ software, I tried to ask the company how to uninstall it.

I was, finally, getting around to doing AWS certifications, and one of the way you can do that is via the Pearson OnVue proctoring system. You run software that limits what you can use your machine for, stops you using multiple windows to look up answer. It asks for a fair bit of access when you run, unsurprisingly.

I installed it, I ran the test to see if it worked, and shortly afterwards was having problems with my computer and the clipboard – and wondered if that was a ‘feature’ of this software.

I have since resolved these separately, but I wanted to uninstall OnVue, but there is NO detail of how to do this on the website.

I have asked all their contact channels, “how do I uninstall it” – I get back a variety of canned responses:

  • “you can’t cancel an exam via email”
  • “if you run a test exam you can see if the software works on your system”

Now given it doesn’t look like it was an “installed” app and just an application that ran from a zip file, uninstalling it could be as simple as not running it ever again, and deleting the download.

But I don’t know if it hasn’t installed a few system extensions or similar, I’ve had a quick look and I don’t think it has but is it too much to expect a webpage owned by Pearson that says that – right now the search results for “remove onvue macintosh” are swamped by advertising pages for Mac removal software.

A document that states what to do online, an answer in the knowledge base so agents respond – is really the bare minimum.

If you make me run/install it, give me a clear way to remove it.

The Weaponisation of Resilience

(This post inspired by some posts I saw on LinkedIn, and some client experiences from many many years ago)

Resilience is a useful property.

We want it in many places: In our infrastructure from floods or traffic spikes, in our organisations from attacks by bad faith actors, and within ourselves from unexpected or exceptional events in our lives.

Now, in the last few years *waves hand* we’ve had quite a bit going on, and many of us have had to call on that resilience reserve more than usual.

Maybe as a results of that, or a general awareness of mental health in the workplace, it’s now something that’s being taught to employees.

While I think that is a good thing, I think that has the perception to be weaponised.

A resilient worker, or more likely a team, has the skills/headroom/reserve to cope with a “once in an N” event, every “N”. So a “once in a week” exception every week, a once in a month exception every month, etc.

I worry that bad managers and teams, will weaponise that resilience, and expect teams to be resilient against all events, even if they’re facing a ‘once a year’ event every month.

Exceptions aren’t avoidable. Things do change, go wrong, go better, go worse.

You can’t avoid all exceptions, and those are the ones that will draw on the “resilience reserve”, but if your team is constantly facing exceptions that are caused by poor coordination or planning – you’re wasting that valuable resource you should be keeping for elsewhere.

Giving your team the tools to be resilient is great, but you’re not giving them invincibility.

Is your software more important than you realise?

Software that isn’t “safety critical” can have real-world impacts.

If you’ve been working in IT for as long as I have been, you’ll maybe remember this wonderful example of legalese:


Windows NT4 License agreement

It’s a pretty good example of where our minds tend to go when you think of “safety critical” systems. I tend to also think about things like complex train automation systems or the Therac-25 radiation therapy machine.

All things that are complicated, but are generally grounded in physical interactions with machinery, machinery that has high energy, or that interacts with other humans.

This came to mind because once again the Post Office Horizon Scandal, one of the biggest miscarriages ever in justice in the UK, is in the news.

If you weren’t aware, the system was buggy, could cause the branch to have massive shortfalls, giving postmasters three options.

  • Make up the loss up themselves, and hope the problem didn’t happen again
  • Report that their accounts balance, which was an act of fraud
  • Try to report to the Post Office which would be unhelpful at best, or began an investigation at worse

The results of this were bleak:

  • People were wrongly convicted of fraud/stealing from the Post Office
  • People were wrongly imprisoned
  • Some people ended their lives in the immense shame of being someone who stole from their local community

In hindsight, that looks pretty safety critical… lives were materially changed, damaged, or extinguished.

What’s worse is that people from the software vendor, and the post office claimed that the system was robust, that remote access wasn’t possible – at the same time as planning remote access to resolve issues caused by known bugs.

The latest BBC Radio 4 program on this (after an amazing series), had an instance where a Post Master lost his branch due to these bugs, a new owner bought the shop, only to then experience the same bugs. The helpline gave the same line “Nobody else is reporting these problems” which sounds highly unlikely to be true.

Sure some senior people at the time have stepped down from their non-exec directorships.

In my view this is either negligence as they should have done the due diligence to ascertain that the system was generally robust.

Software is everywhere.

Ovens have Wifi, cars have highly complex computer vision, human bodies have attachments controlling insulin flows. People had artificial eye implants to help them see, that the manufacturer no longer supports.

Whistleblowing is a painful and sacrificial act for the person who does it.But if you see people from your company, testifying in courts of law that “there are no problems with the software” (an impossible situation in all by the smallest of programs), we need to provide better ways to help this information surface.

Maybe if defence teams were better briefed, a statement like that could be countered with “No problems? Cool, we’ll verify that with an extract from your Jira instance” or “a third-party code review wouldn’t be a problem”?

I don’t know the solution.

I’m not a lawyer, I’m not an ethicist, I’m not someone who typically works in these kinds of environments – but I do know that lives were lost due to an accounting system being buggy.

And that doesn’t sit right with me.