In praise of Advent Of Code

Advent of Code is a great way to brush up/crosstrain your coding skills

I’ve written previously about how I learn, and this year I’d like to add Advent of Code to that regime.

If you’ve not done it before, AoC (which we’ll agree is confusingly overloaded with the America Congresswoman), presents you with 24 problems, each with a simple part, and a (usually) more complex extension.

In prior years I’ve used Python, this year I decided to write them in Swift. While not the most obvious language for these challenged (Python feels more appropriate), but this year I wrote a SwiftUI iOS app. Alas the client didn’t go for it in the end, but while doing that I was doing a lot of “change it and try” style edits – which I realise now was because I didn’t know a bunch of the more basic swift features/idioms… You know, those kind of things that are really easy to skip over when jumping from one language to another, because a “beginners” tutorial is too easy.

Anyway, over the course of the exercises I’ve learned a few things about swift, I’ve remembered a few things about software engineering, and my code is becoming better… which is the point.

Finally I am remembering that I’m doing this for “fun”. I’m learning, learning is fun for me – but also I don’t have to get things to work. If I can’t get to the second problem phase, that’s good. Hell a few of the problems I just didn’t get that day, and didn’t even get the first part.

And that’s ok, because while if was being paid, I would continue to hit my head until I got the code to work… this is not a thing I have to do, it’s a thing I’m choosing to do.

So if you’re looking to stretch those particular muscles again, I would recommend.

“What will the performance be like?”

Despite the documentation, blog posts and experiences, you’re just going to have to test it…

AWS offers a dazzling array of services for similar things. Dazzling is a great synonym for “confusing”. Feels like “Speed of Innovation”, “Range of Services” and “Coherence of Offerings” is a “pick 2 of 3” situation…

An often asked question on the AWS communities I hang out on is “can we use a Lambda for that?”

People asking this are often primed with some of the problems of serverless compute, and jump to the advanced features to mitigate them:

“I’m worried about cold starts, do I need to pay for provisioned concurrency?”
“I’m accessing a database, do I need to add an RDS Proxy?”
“I worry about response time, do I need to enable API Gateway caching?”
“I worry about out-of-order”, do I need First-In First-Out queues¹

These are valid questions, with valid solutions, but paraphrasing a global sportswear brand: Just Try It.

Deploy your code, and see how it performs/responds. Then apply the tricks to make it better.

Starting with the dirtiest hack of all: Make it bigger.

Lambda CPU is only controllable by memory, and I’ve been in multiple situations where there was a step-change in performance jumping from 64 to 128 megabytes. There’s tooling that can help you find the right size for your application.

It’s lazy, brute force, but takes literally seconds to change the memory and re-test.

I’m a big proponent of ‘innovation as Business as Usual’: you shouldn’t need the excuse of an innovation day to try the new things. That only works if you start simple.

But I’m an even bigger proponent of not stressing yourself/teams unnecessarily: If you don’t have time to test in your current project, do what you know works. Hit your shipping dates, have an easier sprint.

Hopefully you can have a play in the future and add something else list of what “works”, it’s nice to expand that list, but only when you’ve enough headroom to do so.

In conclusion, while sometimes I can tell you that something is unlikely to work, most of the time you’re just going to have to try it.

Sorry.

I understand FIFO messaging has its place, but I work in content, not transactions. Including a ‘reliable’ timestamp in the event (e.g. the time an article was published, not the time the message was sent) is cheaper and allows consumers to identify if they should process an event, allowing easier recovery if databases have to be re-crawled. ↩︎

Systems are designed in a context

I’ve been following a bit of Internet Of Things drama because a company Cease & Desisted a developer who was polling their API for an unofficial Home Assistant Integration.

Thankfully, it looks like the company is now engaging with the Developer and The Home Assistant team, so maybe it can be resolved.

But one of the recurring comments initially was “if they’re saying it costs too much money, they should use a cheaper host than AWS”

Disclosure, my career is helping companies use the right bits of AWS well, and for all my discomfort at the centralisation of the internet into the Hyperscalers – the ability to deploy a website that costs nothing if nobody uses it, yet scales to match demand, is pretty compelling.

“What Context do you mean Gareth?”

Backing up a second, this integration polls the endpoint pretty aggressively, and so could be causing a noticeable spike to API calls, and to the costs experienced by Haier.

If there were about 500 active installs, that could cost something between 500 and 1,500 USD per month, based on the polling rates, probably immaterial in the grand scheme of things, but noticeable.

The context I’m talking about, is that they built this API knowing it wasn’t going to be called that much… many of the interactions customers saw would be driven by push notifications or other event driven things. It’s only called by the app/website when people visit… it was never costed for being called continuously.

They didn’t optimise that cost, because it didn’t make sense to.

But still, why aren’t they doing it cheaper?

Bluntly because “total cost of operation” (TCO) is more than just the compute.

Sure I could just “spin up a VM and do it all myself” but then I’m doing it all myself. I’m suddenly in the realm of patching. I’ve got to do my own auth.

If AWS are terminating your HTTPS endpoint, then AWS are on the hook for the latest HTTP desynchronisation exploit.

When you host stuff on managed services (of any provider) you’re really going hands off, you don’t need an ops team because ultimately, there isn’t generally anything that you can operate… If there’s an outage, often times, you just have to wait for the provider to fix it.

That might feel disempowering, but also, there’s no point having an on-call team and waking someone at 3am to go “Yup, it’s fucked” but be unable to actually fix it.

In summary

There’s usually a cheaper way to deploy things.

The question is “is it actually cheaper, everything else considered?”

Don’t try to second guess other peoples architecture, and if you are going to poll an API, try to do it considerately…

Complete means complete…

Quick rant here, but I’m in the process of leaving a service that requires some careful presentation of the data/documents for regulatory type purposes.

Fine, I think, they tell me to head to the download page, and download my data.

Use this to get a complete export of all your XXXX data or restrict by date. The export includes all accompanying receipt attachments and PDF documents where relevant.

If you need to restrict the results further you can do this by using the filters on the relevant list and running the export from there.

That sounds pretty definitive, great, it’ll be a big zip file with everything.

“Complete export” is even in bold, that shows intent.

The zip file expands into a folder with the word “complete” In it.

BUT as you’ve now doubt guessed from all the signposting, it is not however complete. There are a bunch of things not included.

It’s not stated what things aren’t included, because you know, that would be too easy.

TL;DR, If you use the world ‘complete’ in your data extraction, include everything

At the very least tell me what’s missing.

Good to have it confirmed why I’m leaving this service.

State-Issued Identifiers aren’t generally good passwords

Just because a number is long, doesn’t mean that it’s secure…

For many years, despite repeated requests, a South African bank has been sending me bank statements.

The thing is that I don’t live in South Africa, have never visited, nor banked there. But I do have a particularly “good” email address that gets a lot of misdirected emails… I usually don’t read them, but this week I did, as I replied for the latest attempt of “please stop”.

The email included a PDF statement, password protected with the South African ID number of the customer. I suspect that protection is why the service centre seem so unbothered by the repeated requests to stop this information.

A few years back, I got very embedded in PII/GDPR. We were designing a data warehouse setup that allowed analysis of user activity, while protecting privacy, and enabling easier compliance with GDPR deletion requests. There was discussion about the feasibility of SHA2 hash reversal… and we took a lot of time to communicate the infeasibility with the legal team.

So this week, I started to wonder: If I were a bad actor (which I am not), could I feasibly crack this ID with some Python?

What’s the challenge?

There are three sets to consider in this:

The Absolute Keyspace: without any knowledge of the identifiers, the number of combinations?
The Available Keyspace: if the ID has constraints, how many valid combinations are there?
The Effective Keyspace: if we know anything more, how many combinations are applicable?

A good security identifier should make discerning the differences between these difficult: it should involve a reasonable amount of calculation before it becomes obvious that an ID is valid, and correct.

What do we know?

South African ID Numbers are 13 decimal digits.

A single decimal can be 0-9, ten in total, and so each digit has a a cardinality of 10.

(Cardinality being the maths word for ‘number of choices’ in a set)

We can use this information to calculate the absolute keyspace:

\begin{align*} Keyspace_{Absolute} = &10 \times 10 \times 10 \times 10 \times 10 \\ & \times 10 \times 10 \times 10 \times 10 \\ & \times 10 \times 10 \times 10 \times 10 \\ = & 10^{13}\\ = & 10,000,000,000,000 \end{align*}

Since this work was derived in a Jupyter notebook, I’ll also include some python as we go along…

Python Code for absolute keyspace

number_of_digits = 13
keyspace_absolute = 10 ** number_of_digits
print(f"Number of combinations: {keyspace_absolute:,}")

Number of combinations: 10,000,000,000,000

Is this feasible?

So without knowing anything about the ID, there are 10 trillion combinations to check.

Python can attempt to open a PDF with a password around 150 times per second. This would be our basic implementation.

More specialised tools like John the Ripper raise that rate 4,500 per second, that’s around 30 times faster.

We’ll put these into a summary function, as we’ll be calling this a few times.

Python for Summary Function

test_rate_slow = 150
test_rate_fast = 4500
def summary_to_brute_force(combinations):
    hours_to_test_slow = (combinations / test_rate_slow) / 60 / 60
    hours_to_test_fast = (combinations / test_rate_fast) / 60 / 60
    return f"At slow rate: {hours_to_test_slow:,.2f} hours, or: {hours_to_test_slow / 24:,.2f} days" + \
           f"\nAt optimised: {hours_to_test_fast:,.2f} hours, or: {hours_to_test_fast / 24:,.3f} days"

print(summary_to_brute_force(keyspace_absolute))
At slow rate: 18,518,518.52 hours, or: 771,604.94 days
At optimised: 617,283.95 hours, or: 25,720.165 days

At this stage, it would not be worth the cost to brute-force this.

The information in the ID, or the document, is not valuable enough.

Scoping the “Available Keyspace”

The South African ID format is described here and this OECD PDF.

The format is YYMMDDSSSSCAZ:

YYMMDD is the date of birth
SSSS separating people born on the same day
- Female entries start 0-4, males entries start 5-9
C represents citizenship status
A was previously used to represent race, but is now unspecified
Z is a checksum digit, using the Luhn algorithm

How does this help us?

The Z check digit reduces the key space by a factor of 10: we “only” have to brute-force for the first 12 digits, and calculate the 13th.

Since A is unspecified, we will leave its cardinality unchanged at 10.

The C citizenship status can be 0, 1, or 2. This digit now has cardinality of 3 instead of 10.

Dates

The YYMMDD digits are dates, these have constraints:

Years are from 00-99
Months are from 01-12
Days are from 01-31

If we just consider those digits individually, we can calculate the cardinality like this:

\begin{align*} dates &= 10 \times 10 \times 2 \times 10 \times 4 \times 10 \\ &= 80,000 \end{align*}

But that’s going to consider many impossible days: month 19 doesn’t exist, nor does day 35.

So we could just consider these 6 digits as a combined date field, and get a more useful answer:

\begin{align*} dates &= years \times days \\ & = 100 \times 365 \\ &= 36,500 \end{align*}

(Yes, I am ignoring leap-years in this calculation… they’re not material to this calculation)

Our new understanding of the ID number format comes together, and we can compare the Absolute with the Available keyspaces:

\begin{align*} Keyspace_{Absolute} = &10^{13}\\ = &10,000,000,000,000 \\ Keyspace_{Available} = &valid\_dates \times serial\_numbers \\ & \times citizenship\_status \times A\_column \\ = &36,500 \times 10,000 \times 3 \times 10 \\ = &10,950,000,000 \end{align*}

So even without any knowledge of the target, only 0.1095% of the original key space needs to be searched.

Python for Available keyspace

cardinality_of_birthdays = 100 * 365 # we're ignoring leap years
cardinality_of_serial_numbers = 10000
cardinality_of_citizenship_states = 3
cardinality_of_the_a_digit = 10
keyspace_available = cardinality_of_birthdays * cardinality_of_serial_numbers * cardinality_of_citizenship_states * cardinality_of_the_a_digit
print(f"Number of valid ID numbers: {keyspace_available:,}")
print(summary_to_brute_force(keyspace_available))

Number of valid ID numbers: 10,950,000,000
At slow rate: 20,277.78 hours, or: 844.91 days
At optimised: 675.93 hours, or: 28.164 days

One month with optimised checking could be feasible, especially if you rented some machines… but can we do better?

What’s the Effective Keyspace?

A badly recreated email from a bank, with a banner "your electronic statement". The intro reads: Dear Mr G. Customer Please Find Attached your Statement for May • Information that only you will know is displayed in the eStatement verification block. This is done so you can be sure your statement is from BANKCO. • You will be required to enter the 13 digits of your identity number to view your statement. There is an additional box: Verification Info account number: *******5678 ID Number: *********1234 — A recreation of the email sent by the bank

Revisiting the email, it contains some info I’ve ignored until now:

The last 4 digits of the ID as verification
The recipient is addressed as ‘Mr Customer’

Going back to the format YYMMDDSSSSCAZ.

We know the values for C & A, so those now have cardinality of 1.

C ‘only’ had a cardinality of 3, so that’s excludes 67% of possibilities
A had a cardinality of 10, so that excludes of 90% of possibilities

These combine however, so the remaining amount from knowing C & A is: $\frac{1}{3} \times \frac{1}{10} = \frac{1}{30}$

Let’s reconsider the SSSS block: which we’ll refer to as S1, S2, S3, S4.

Since our ID is male we know that the S1 must be 5/6/7/8/9, so cardinality of that digit is now 5
We know S4, so it has cardinality of 1

Again these combine, so the total remaining is: $\frac{5}{10} \times \frac{1}{10} = \frac{1}{20}$

\begin{align*} SSSS_{possible} &= 10^{4} = 10,000 \\ SSSS_{from\_email} &= 5 \times 10 \times 10 \times 1 \\ &= 500 \end{align*}

Checking in the formula again:

\begin{align*} Keyspace_{Absolute} = &10,000,000,000,000 \\ Keyspace_{Available} = &10,950,000,000 \\ Keyspace_{Effective} = &valid\_dates \times serial\_numbers \\ & \times citizenship\_status \times A\_column \\ = &36,500 \times 500 \times 1 \times 1 \\ = &18,250,000 \end{align*}

We’re now down to 18.25 million possible keys to check.

Python for effective keyspace

cardinality_of_birthdays = 100 * 365 # ignoring leap years for now
cardinality_of_serial_numbers = 500
cardinality_of_citizenship_states = 1
cardinality_of_the_a_digit = 1
total_number_of_email_matching_combinations = cardinality_of_birthdays * cardinality_of_serial_numbers * cardinality_of_citizenship_states * cardinality_of_the_a_digit
excluded_by_using_email_information = keyspace_available - total_number_of_email_matching_combinations
print(f"Number of numbers matching email: {total_number_of_email_matching_combinations:,}")
print(summary_to_brute_force(total_number_of_email_matching_combinations))

Number of numbers matching email: 18,250,000
At slow rate: 33.80 hours, or: 1.41 days
At optimised: 1.13 hours, or: 0.047 days

Even the naive, 1.41 days is really starting to look feasible, and with John the Ripper, we’re already doing it in little over an hour.

But what about the check digit?

Earlier we ignored the check number, since we can calculate it… but we were supplied it in the email.

We can use it to see if the ID is a potential match, and only check matching ones against the file.

Luhn format checkdigits use simple modulo 10 arithmetic.

This means only 10% of the generated IDs will be checked against the PDF password.

Python for checksum validated keyspace

using_checkdigit_exclusion = total_number_of_email_matching_combinations / 10
print(f"Entries matching email check digit: {using_checkdigit_exclusion:,.0f}")
print(summary_to_brute_force(using_checkdigit_exclusion))

Entries matching email check digit: 1,825,000
At slow rate: 3.38 hours, or: 0.14 days
At optimised: 0.11 hours, or: 0.005 days

Our ‘effective’ keyspace is now 1,825,000 entries.

So even with a naive implementation just in Python, we can do it in less than a day.

Age scoping

A friend pointed out that searching all ages between 0-100 is a bit pointless, so we could change that to be a range of 18-70?

Because the birthday field covers 100 years cleanly, we calculate the number of years we want to test.

keyspace_{Age Scoped} = keyspace_{effective} \times \frac{years\_to\_test}{100}

However, given the effective keyspace we’re already down to, the impact of age reduction feels less useful in this scenario, if you had less information this reduction could be more useful.

Python for age reduced keyspace

min_age = 18
maximum_age = 70
age_range = maximum_age - min_age
percentage_of_people_in_age_range = age_range / 100
total_number_in_age_range = using_checkdigit_exclusion * percentage_of_people_in_age_range
print(f"Number in valid age range: {total_number_in_age_range:,}")
print(summary_to_brute_force(total_number_in_age_range))

Number in valid age range: 949,000.0
At slow rate: 1.76 hours, or: 0.07 days
At optimised: 0.06 hours, or: 0.002 days

In summary

From an absolute key space of 11,000,000,000,000, we’ve excluded over 99.99999% of the possible numbers, and have only 949,000 to check against the file.

Python to generate table

import pandas as pd
data = [
            [ keyspace_absolute, 100],
            [ keyspace_available, 100 * keyspace_available / keyspace_absolute],
            [ total_number_of_email_matching_combinations, 100 * total_number_of_email_matching_combinations / keyspace_absolute],
            [ using_checkdigit_exclusion, 100 * using_checkdigit_exclusion / keyspace_absolute],
            [ total_number_in_age_range, 100 * total_number_in_age_range / keyspace_absolute],
      ]
df = pd.DataFrame(data, columns=["Count", "Percentage of Total"], index=["Absolute Keyspace", "Available Keyspace", "Email Keyspace", "Using Email checkdigit", "Limiting by Age"])
brute_force_label = f"Hours @ {test_rate_slow}/s"
brute_force_faster_label = f"Hours @ {test_rate_fast:,}/s"
df[brute_force_label] = df["Count"] / test_rate_slow / 60 / 60
df[brute_force_faster_label] = df["Count"] / test_rate_fast / 60 / 60
df.style.set_properties(**{'font-family': "Menlo, Consolas, Monospace"}) \
  .format(subset=["Count"], thousands=",", precision=0) \
  .format('{:.6f} %', subset=["Percentage of Total"]) \
  .format(thousands=",", precision=1, subset=[brute_force_label, brute_force_faster_label])

Set of potential ID Numbers	Size of set	Percentage of Absolute	Hours @ 150/s	Hours @ 4,500/s
Absolute Keyspace	10,000,000,000,000	100.000000%	18,518,518.5	617,284.0
Available Keyspace	10,950,000,000	0.109500%	20,277.8	675.9
Email Keyspace	18,250,000	0.000182%	33.8	1.1
Using Email Checkdigit	1,825,000	0.000018%	3.4	0.1
Limiting by Age	949,000	0.000009%	1.8	0.1

Visualisation

Graphing this is really hard, differences in scale make it really difficult to communicate.

Since I’m not a data-vis genius, this uses a log-scale.

Python to generate Graph

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data = df['Count'][::-1]
data.plot(kind='barh', logx=True, title="Keyspace Reduction\nnote the log-scale",ax=ax)
vals = ax.get_xticks()
ax.set_xticks(ax.get_xticks()[::2])
ax.set_xlabel("Keys in Keyspace")

Conclusions

This analysis shows that in common with recent attacks on the Tetra encryption system , if you’re not using all of the absolute keyspace, your protection is far weaker than may appear from a big number.

These national/structured IDs do not make good secrets: the structure inherently reduces the size of the effective keyspace, and makes it very easy to exclude ranges of people (by age or gender).

While phishing is a problem, and emails need to be/appear authentic – we need to use mechanisms to achieve this at the email level: SPF, DKIM, DMARC, BIMI. While imperfect, these are far better than including information directly related to the ID/information being protected.

In this scenario even with a naive implementations, it would be entirely feasible to brute-force this particular email/pdf combination, which would expose customer information.

Now I don’t know how valuable that information in the statement is, but I wonder if it be used as part of a social engineering attack?

A plea to companies: If I message asking you to stop send me statements, maybe stop?

Save Money, deploy IPv6 in your VPC

Getting past uncertainty or unwillingness to use IPv6 is actively costing you money.

I (mostly) have IPv6 deployed in my home network, alas a bug on my router currently prevents it working completely, but for years it’s been enabled, and mostly as a novelty to feel like I was ready for the grand future – but without really noticing any differences.

Last week however, I had a chat that made me realise that using IPv6 is now a cost-saving measure… and it’s maybe time to get over our resistance to do it by default.

The Joy of NAT

Typically you deploy an AWS VPC using internal IPv4 addresses, and (if you’re as allergic to avoidable self-managed service as I am), a managed NAT gateway.

As with nearly everything AWS, there’s an hourly cost, and also charges for the volume of data transferred.

In an AWS group I’m a member of, someone asked “We’re paying a lot for NAT, what can we do” – and I said “I mean, you could try the IPv6 Egress Gateway, but I dunno if the APIs you’re using support IPv6”.

The Egress only IPv6 gateway only charges ‘standard’ egress rates and it has no rental cost. I’ve been aware of its existence for years, but have never deployed it or had reason to.

I was expecting to be told “Only one of my APIs supports IPv6” but the person reported back “Actually nearly all my APIs are on IPv6, but I can’t deploy the IPv6 easily because of <reasons>”.

This was not what I was expecting: despite, remembering that for the last few years wherever I can deploy a dual-stack endpoint, that I have been…

So why haven’t we

There are a few reasons why we haven’t been deploying IPv6 more routinely, and we should try to change these:

IPv6 didn’t have an advantages. In most cases the IPv4 setup works ok… So what’s the point of adding it to a working setup…
The deployment tooling doesn’t support it – this was the case here, the person was using the lovely AWS CDK to deploy their VPC, and the current VPC ‘construct’ doesn’t support IPv6 easily
We like NAT’s Security by default: With NAT, your compute resources aren’t exposed to the internet, all that endpoints see are 1 or 2 shared IPs… There’s no way to accidentally open inbound connections, and that security by default is pleasing. Even though the IPv6 Egress Only gateway doesn’t allow inbound connections, you do feel more exposed, and so much of security is an annoying vibes type thing

Some steps to take

It’s sometimes easy to forget that we have run out of IPv4 addresses, and that we’re muddling through with Carrier Grade NAT and other annoying technologies that work, but make everyone’s lives just that little bit worse… But we can improve this incrementally, and anyone who’s worked with me knows I love making things better gradually.

Enable IPv6 on any managed services you use by default. If you’re running CDN hosts on CloudFront, or an API on API Gateway, those can all run IPv6, and are running it outside of your VPC. There is no security risk to you, and by creating those, you give other people benefit
Enable IPv6 in your VPC, but just on your public subnets. If adding IPv6 makes you nervous, start with ‘just’ doing it on your public subnets, and use it to given your load-balancers IPv6 addresses. This makes life better for other people, and gives learning for the next step
Enable IPv6 with Egress only gateway, for any subnets that have NAT: this is where you start saving money, as hopefully you’re NAT’ing less traffic and saving on those charges

Not every use-case will save money, I haven’t generally had the patterns of bandwidth usage that would benefit from this… but if you’re using a lot of NAT bandwidth, maybe it’s time to look at IPv6 as a potential cost-saving.

About “That” Prime Engineering article

Everyone who has worked with managed cloud services has experienced the moment when it made sense to move away from managed services.

Turns out, so does Prime Video.

Amazon Prime Video recently wrote about how changing away from managed services and writing a more integrated application saved them money. Despite being a few months old, this appeared to blow-up this week, and predictably has caused some cries of “SEE, SEE YOU SHOULD JUST RUN EVERYTHING YOURSELF”.

But to those of use who have been building on AWS (and other providers) for many years, it’s not a surprise, and we all have stories where we’ve done similar.

I say this as someone who is an annoying cheerleader for serverless compute and managed services, but despite that, I have home-rolled things, when it made sense.

How do you solve a problem

When you’re solving a problem, you look at what the managed services that you have available, considering factors like:

Your teams experience with the service
Limitations on the service, and what it was intended for, against what you’re doing
What quotas may apply that you hit
How the pricing model works

While pricing for managed-services is generally based on usage, sometimes specific costs will apply more to your workload, e.g. if you’re serving small images, you’ll be more impacted by the per-request cost than the bandwidth charges.

I would be surprised if an experienced architect hasn’t faced a situation where “Service X would be perfect for this, if only it didn’t have this restriction Y, or wasn’t priced for Z”.

My example

We’d built out a system that was performing 3 distinct processing steps on large files.

The system had built out incrementally, and we had the 3 steps on three different auto-scale groups, fed by queues.

While some of the requests could be processed from S3 as a stream, one task required downloading the file to a filesystem, and that download took time.

The users wanted to reduce the end-to-end processing time. Some of the tasks were predicated on passing prior steps, and so we didn’t want to make the steps parallel.

Attempt 1: “EFS looks good”

We used the ‘relatively’ new Elastic File System service from AWS… The first component downloaded the file, subsequent operations used this download.

This also had the advantage that the since the ‘smallest’ service was first, you paid for that download on the cheapest instance, and the more expensive instances didn’t have to download it.

We developed, deployed, and for the first morning it was really quick… until we discovered that we were using burst quota, and spent the afternoon rolling back.

Filesystem throughput was allocated based on the amount stored on the filesystem, but as this was a transient process, we didn’t replenish it quickly enough, and didn’t like the idea of just making large random files to earn it.

Now you can just pay for provisioned throughput, perhaps in a small part because of a conversation we had with the account managers.

Attempt 2: “Sod it, just combine them”

The processes varied in complexity, there was effectively a trivial, a medium, and a high complexity task… So the second solution we approached was combining all the tasks onto a single service… the computing power for the highest task would zoom through the other two tasks, and so we combined them into what I jokingly called “the microlith”.

We didn’t touch the other parts of the pipeline, or the database, they remained in other services, but combining the 3 steps worked.

What did we gain

The system was faster, and even more usefully to operators, more predictable.

Once processing had started you could tell, based on the file size, when the items would be ready…

Much like “lower p90 but higher maximum” feels better for user experience, this consistency was great.

What did we lose

Two of the three components had external dependancies, and this did mean this component was one of the less ‘safe’ to deploy, and while tests built up to defend against that… the impact of failed deploy was larger than you’d want.

In Conclusion

There are always times when breaking your patterns makes sense, the key is knowing what you’re both gaining and losing, and taking decisions for the right reasons at the right times.

Prime video refining an architecture to better meet scaling and cost models, making it less “Pure”, isn’t the gotcha against these services that some people would have you believe.

“Pure” design doesn’t win prizes.

Suitable design does.

Can I port my number out of Skype UK?

Here’s the one wild secret they don’t want you to know.

Please forgive me the horrible SEO title and URL…

I recently ported a number out of Skype UK to a much cheaper SIP VOIP provider. Skype served me well for a number I really just needed for compliance reasons… and it seemed obvious that I should actually transfer it to a supplier that would cost me about 12GBP per year, rather than 57 EUR as I was paying.

Number porting in the UK is a magical mystery tour, for all the times it “just works” other times you’ll feel like you’re playing the worst possible game of DND, with a horrible enemy and a disinterested Game Master.

Skype initially said “you don’t need to do anything special, just ask the new provider to port in” but turns out there was this one secret trick they didn’t think I’d want to know…

Skype expect your ‘surname’ as provided by your new supplier, to actually be your Skype username.

I discovered this after two failed porting attempts (for which my new super cheap provider charges me, which I understand, since the ongoing rental is so cheap).

I had asked Skype multiple times what to do: before the first port, after the first failure, after the second failure.

Most of these interactions were pretty infuriating “Porting is done by your new provider… it’s nothing to do with us” – despite that fact that My New Provider asks Skype to port, and Skype then says yes/no… They are pretty involved in the situation.

Skype wouldn’t tell the new provider what was wrong, beyond “the surname didn’t match” – This is entirely correct, because since Skype UK don’t seem to implement any porting code/verification scheme, for security they can’t and shouldn’t tell the port destination…

However they wouldn’t then tell me, via an authenticated channel, what they were expecting. Merely “you did it wrong try again… porting is nothing to do with us.”

It was only when I then went on another chat, talking about going to the ombudsman, and generally being a pain, did the porting team authorise the port, and revealed to the provider “yeah, the surname needs to be the username”.

This is not mentioned on their documentation. Or provided in live-chat because it’s seemingly a rare occurrence.

So if you’ve come to this post, having had problems porting out, try again telling your new provider that your ‘surname’ is in fact your username.

2023 Comms Resolutions

What things can you do in 2023 to make you communications more efficient and considerate in the world.

I don’t really like New Years resolutions for reasons beyond the scope of this post.

This year however, am going to try and make a few changes to how I communicate, in work and otherwise.

“No Hello”

No Hello on instant messaging.

I hate being on the end of the Dangling Hello, and the 15 minutes of massively predicting what the person is wanting. But I still find it very hard to bundle in all up in the first message.

Equally, 4 notifications in quick succession can feel like literal torture.

You can still ask how people are doing, but you can just include that upfront, in a single message.

Hello X, hope you’re good, can you tell me what’s going on with TICKET-123
Me, Slack, This Year

Priority Tagging, ideally lower

Low Priority exists as well as High Priority on emails.

Flagging a gossipy/catchup IM as such in the opening.

Clarify & Summarise

The discomfort at being That Guy who pastes back the summary of what you agreed is less than the pain when you discover that you weren’t all sharing understanding.

When half the team thought “advance by 2 seconds” meant delete 2 seconds, and the other half thought add 2 seconds..

Always a default

When arranging things, I’m going to offer a default, always.

“I’m free all day” vs “I’m free all day, how about 11”

Make it easier to say “Great” done.

Stick to Core Hours

I’m a freelancer, I work self-defined hours… but that’s not mine to share with others.

While it’s useful for me to get thoughts out of my head into an email, that doesn’t mean I need to get them forced onto other people…

If I’m sending an email, I’ll set it to send later
If it’s an IM, I’ll set Slackbot to remind me or maybe the person, during the next working day

In Conclusion

We all drowning in a sea of notifications, if you can make yours just a little better, you make it easier for people around you to help you.

Smartspeaker listening is massively up, but sure, delaying podcasts will help drive adoption

The BBC really need to add more Siri intents to Sounds to enable smart speaker listening.

Rajar data continues to show that Smart speaker listening is on the up.

As someone thoroughly locked into the Apple ecosystem, until recently I’ve not been able to easily ask to BBC Radio services through my HomePods. I had to follow some Reddit posts to install shortcuts “hey Siri play Bbc 6 music” and suddenly I find myself listening to more BBC Radio as a consequence.

Previously I think you had to listen to stations ‘enough’ for Siri to recognise the activity, then it could be a suggested shortcut. It was a bit ugly, and down to how intents originally worked.

The Siri APIs have got more developed over time, and now it is possible to do this in a way that doesn’t require upfront declaration.

It’s funny then, that given the choice of “adding a play with Siri intent to BBC Sounds” or “delaying podcast release to open platforms”, that Auntie chose the latter…

This has the strange result:

The BBC’s first party actions have me listening to less BBC Podcast Audio – why listen to a topical podcast 4 weeks later, and having to remember to check BBC Sounds doesn’t match my workflow
Actions by a Third-Party, have me listening to more BBC Live audio

I know there are always backlogs, but Siri intents have been around for a while now…